Methods of identifying mutations in nucleic acid

ABSTRACT

The present invention provides methods of identifying mutations in nucleic acid. Also provided herein are methods of identifying subjects having Hirschsprung disease risk and diagnostic markers for Hirschsprung disease.

RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.11/920,908 filed Nov. 6, 2009, which is a U.S. national phaseapplication of PCT/US2006/020580, filed May 26, 2006, which claims thebenefit of U.S. Provisional Application Nos. 60/684,686, filed May 26,2005 and 60/684,903, filed May 26, 2005, the entire contents of whichare expressly incorporated herein by reference.

GOVERNMENT SUPPORT

The following invention was supported at least in part by the NIH.Accordingly, the government may have certain rights in the invention.

SEQUENCE LISTING

The instant application contains a Sequence Listing which has beensubmitted in ASCII format via EFS-Web and is hereby incorporated byreference in its entirety. Said ASCII copy, created on Jan. 29, 2007, isnamed 65532.txt and is 6,057 bytes in size.

BACKGROUND

The identification of common variants that contribute to the genesis ofhuman inherited disorders remains a significant challenge. For example,Hirschsprung disease (HSCR) is a multifactorial, non-Mendelian disorderin which rare high penetrance coding sequence mutations in the receptortyrosine kinase RET contribute to risk in combination with mutations atother genes.

Hirschsprung disease (HSCR), or congenital aganglionosis with megacolon,occurs in 1 in 5,000 live births. Heritability of HSCR is nearly 100%with clear multigenic inheritance. While RET represents the majorimplicated HSCR gene^(1, 2), mutations also occur in seven other genesinvolved in enteric development, specifically ECE1, EDN3, EDNRB, GDNF,NRTN, SOX10, and ZFHX1B². Less than 30% of patients, however, havemutations in these eight genes; thus, additional HSCR-causing mutationsin RET and/or at other genes must exist.

Thus, there is a need in the art for methods of identifying variantsthat contribute to diseases, for example HSCR.

SUMMARY

Provided herein, in part by using a combination of human genetic,comparative genomic, functional, and population genetic analyses, aremethods of identifying mutations in nucleic acid, and specificallymethods of identifying subjects having Hirschsprung disease risk.

We have used family-based association studies to identify a diseaseinterval, and integrated this with comparative and functional genomicanalysis to prioritize conserved and functional elements within whichmutations can be sought. We now show that a common, non-coding RETvariant within a conserved enhancer-like sequence in intron 1 issignificantly associated with HSCR susceptibility and makes 20-foldgreater contribution to risk than do rare alleles. This mutation reducesin vitro enhancer activity markedly, has low penetrance, has differentgenetic effects in males and females, and explains several features ofthe complex inheritance pattern of HSCR. Thus, common, low penetrancevariants, identified by association studies, can underlie both commonand rare diseases.

In one aspect, provided herein are methods of identifying a mutation inDNA, comprising predicting a genetic interval for a disease; comparingorthologous sequences to refine a putative functional interval; andsequencing the putative functional interval subjects to identifymutations.

In one aspect, provided herein are methods of identifying a mutation inDNA, comprising predicting a genetic interval harboring mutations thatcontribute to disease susceptibility; comparing orthologous sequences torefine a putative functional interval; and sequencing the putativefunctional interval subjects to identify mutations.

In one embodiment, the methods further comprise classifying the refinedinterval into one or more of coding, non-coding, functional andnon-functional sequences.

In one related embodiment, the further comparing is after comparingorthologous sequences.

In one embodiment, the predicting comprises one or more of transmissiondisequilibrium tests (TNTs), linkage, or association studies.

In another embodiment, the subjects comprise individuals from affectedfamilies.

In one embodiment, the subjects comprise affected and unaffectedindividuals.

In another embodiment, mutations are over-represented in affectedsubjects as compared to normal subjects.

In another embodiment, the mutation is associated with a multigenicdisease.

In one embodiment, the multigenic disease comprise one or more of mentalillness, cancer, cardiovascular disease, congenital anomalies, metabolicdisorder inc but not limited to diabetes, susceptibility to infection,drug response, or drug tolerance.

In one embodiment, the mutation comprises a variant of RET.

In one related embodiment, the RET variant comprises RET+3:T.

In another embodiment, the mutations are one or more of associated witha disease susceptibility, are causative of disease, are contributory todisease,

In one embodiment, the mutation comprises a single nucleotidepolymorphism, a multi-nucleotide polymorphism, an insertion, a deletion,a repeat expansion, genomic rearrangements, or segmental amplification.

In another embodiment, the orthologous sequences comprise vertebratesequences.

In one embodiment, the vertebrate sequences comprise mammalian,reptilian, avian, amphibians, or osteichthyes.

In one embodiment, at least two orthologous sequences are compared torefine the interval.

In one embodiment, the interval is refined by at least 20 fold.

In one related embodiment, the interval is refined by about 10 fold.

In another related embodiment, the interval is refined by about 5 fold.

In one aspect, provided herein are methods of identifying a diagnosticmarker for a disease, comprising predicting a genetic interval for adisease; comparing orthologous sequences to refine the interval; andsequencing the refined interval in affected and unaffected subjects tothereby identify a diagnostic marker associated with diseasesusceptibility, wherein the marker is over represented in affectedsubjects compared to unaffected subjects.

In one embodiment, the methods further comprise classifying the refinedinterval into one or more of coding, non-coding, functional andnon-functional sequences.

In one embodiment, the further comparing is after comparing orthologoussequences.

In another embodiment, the predicting comprises one or more oftransmission disequilibrium tests (TDTs), linkage, or associationstudies.

In one embodiment, the subjects comprise affected and unaffectedindividuals.

In another embodiment, mutations are over-represented in affectedsubjects as compared to normal subjects.

In one embodiment, the mutation is associated with a multigenic disease.

In another embodiment, the multigenic disease comprise one or more ofmental illness, cancer, cardiovascular disease, congenital anomalies,metabolic disorder inc but not limited to diabetes, susceptibility toinfection, drug response, or drug tolerance.

In another embodiment, the mutations are one or more of associated witha disease susceptibility, are causative of disease, are contributory todisease,

In one embodiment, mutation comprises a single nucleotide polymorphism,a multi-nucleotide polymorphism, an insertion, a deletion, a repeatexpansion, genomic rearrangements, or segmental amplification.

In one embodiment, the orthologous sequences comprise vertebratesequences.

In another embodiment, the vertebrate sequences comprise mammalian,reptilian, avian, amphibians, or osteichthyes.

In one embodiment, at least two orthologous sequences are compared torefine the interval.

In one related embodiment, the interval is refined by at least 20 fold.

In another related embodiment, the interval is refined by about 10 fold.

In yet another related embodiment, the interval is refined by about 5fold.

In one embodiment, the methods may further comprise characterizing themarker. In one embodiment, characterizing comprises one or more ofexpression analysis, promoter analysis, regulatory element analysis,knock-out analysis, or knock-down analysis. Methods of analysis are wellknown to one of skill in the art. In a related embodiment, one or moreof the analyses are done with a transgenic animal or a cell line.

According to one aspect, provided herein are methods of identifying asubject having Hirschsprung disease risk comprising detecting in thesubject a mutation in the receptor tyrosine kinase RET, wherein theRET+3:T allele is associated with disease risk.

In one embodiment, RET is a maker for segmental forms of HSCR.

In one embodiment, the subject is a member of an affected family.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A-FIG. 1B depicts transmission disequilibrium tests (TDT). FIG. 1Ashows TDT tests of individual SNPs. The region of 10q11.21 includingRET, GALNACT-2, RASGEF1A. Horizontal line at 50% transmission indicatesexpectation under the null hypothesis. The * identifies RET+3. Exons aremarked by coloured boxes. Black rectangle represents the 27-kb areadisplayed in FIG. 3 a. FIG. 1B shows exhaustive Allelic TDT (EATDT). Themost 5′ SNP shown is RET-5, the most 3′ SNP is X2EagI. Counts oftransmitted and untransmitted chromosomes are given in columns to theright. All haplotypes with permutation-based p values less than or equalto the single most significantly associated SNP (RET+3) are shown.

FIG. 2A-FIG. 2D depicts the identification and characterization ofconserved sequence elements within 350 kb encompassing RET. FIG. 2Ashows a multi-PIP alignment of genomic sequence from 12 vertebratescompared to the human. Red: greater than 75% sequence identity over 100nucleotides; green: greater than 50% sequence identity over 100nucleotides, blue: gaps in contig of 500 nucleotides or more. FIG. 2Bshows northern blots showing expression of GALNACT-2 (GN2) and RASGEF1A(RG1A) in adult mouse tissues. FIG. 2C and FIG. 2D show expression ofRET, GALNACT-2 (GN2) and RASGEF1A (RG1A) by RT-PCR in embryonic mouse(FIG. 2C) and adult human tissues (FIG. 2D).

FIG. 3A depicts VISTA plot displaying percent identity between mouse andhuman in the 5′ region of RET. Estimated transmission frequencies toaffected offspring are shown by red circles. FIG. 3B shows a reportergene expression in Neuro-2a cells using amplicons MCS+9.7 andMCS+5.1/9.7 (Mutant and wild type correspond to nucleotides T and C,respectively). The smaller of the tested constructs (MCS+9.7 only) isbracketed in red. The MCS+5.1/9.7 amplicon encompassing both MCS+9.7 andthe adjacent MCS+5.1 is bracketed in green. All assays were conducted intriplicate and were repeated three times (9 data points total); errorbars represent standard error.

FIG. 4 depicts worldwide allele frequencies of RET+3. Frequencies of theputative wild type (green, C) and mutant (yellow, T) alleles are givenfor 51 populations comprising 1,064 individuals from the CEPH HumanGenome Diversity Panel.

FIG. 5 depicts nucleotide alignment of multiple mammalian sequencesshowing the complete sequence of MCS+9.7. Additional sequence flankingthe MCS is shown in lower-case, gray lettering. Position of thefunctional SNP RET+3 is highlighted in red.

DETAILED DESCRIPTION

Provided herein are methods relating to identifying diagnostic markers,identifying mutations in DNA and identifying subjects havingHirschspring disease risk. In particular, we have shown methodscomprising comparing an identified genetic interval to orthologoussequences refines the interval.

In part, the invention is based on the use of family-based associationstudies to identify a disease interval, and integrated this withcomparative and functional genomic analysis to prioritize conserved andfunctional elements within which mutations can be sought. For example, acommon, non-coding RET variant within a conserved enhancer-like sequencein intron 1 is significantly associated with HSCR susceptibility andmakes 20-fold greater contribution to risk than do rare alleles. Thismutation reduces in vitro enhancer activity markedly, has lowpenetrance, has different genetic effects in males and females, andexplains several features of the complex inheritance pattern of HSCR.Thus, common, low penetrance variants, identified by associationstudies, can underlie both common and rare diseases.

“Mutation,” as used herein, refers, for example, to a polymorphism ormarker that occurs in those at risk of developing a disease, isassociated with a disease or causative of a disease. In certaininstances, the mutation may be strongly correlated with the presence ofa particular disorder (e.g., the presence of such mutation indicating ahigh risk of the subject being afflicted with a disease). However,“mutation” as used herein can also refer to a specific site and type ofpolymorphism or marker, without reference to the degree of risk thatparticular mutation poses to an individual for a particular disease.Mutations, as used herein, are over-represented in affected subjects ascompared to normal subjects and may be associated with a multigenicdisease. The multigenic disease may comprise, for example, one or moreof mental illness, cancer, cardiovascular disease, congenital anomalies,metabolic disorder inc but not limited to diabetes, susceptibility toinfection, drug response, or drug tolerance. The mutation may comprisesa variant of RET, for example, the RET variant RET+3:T. Mutations may beone or more of associated with a disease susceptibility, causative ofdisease, or contributory to disease and the like. Mutations, as usedherein may comprises a single nucleotide polymorphism, amulti-nucleotide polymorphism, an insertion, a deletion, a repeatexpansion, genomic rearrangements, or segmental amplification.

“Linked,” as used herein, refers, for example, to a region of achromosome shared more frequently in family members affected by aparticular disease than would be expected by chance, thereby indicatingthat the gene or genes within the linked chromosome region contain orare associated with a marker or polymorphism that is correlated to thepresence of, or risk of, disease. Once linkage is established, forexample, by association studies (linkage disequilibrium) can be used tonarrow the region of interest or to identify the risk-conferring geneassociated with a disease.

“Associated with” when used to refer for example to a marker orpolymorphism and a particular gene means that the polymorphism or markeris either within the indicated gene, or in a different physicallyadjacent gene on that chromosome. In general, such a physically adjacentgene is on the same chromosome and within 2, 3, 5, 10 or 15 centimorgansof the named gene (i.e., within about 1 or 2 million base pairs of thenamed gene). The adjacent gene may span over 5, 10 or even 15 megabases.Polymorphisms may be functional polymorphisms. “Associated with,” inreference to a mutation being associated with a disease, refers to, forexample, a statistical association.

A “centimorgan” as used herein refers to a unit of measure ofrecombination frequency. One centimorgan is equal to a 1% chance that amarker at one genetic locus will be separated from a marker at a secondlocus due to crossing over in a single generation. In humans, onecentimorgan is equivalent, on average, to one million base pairs.

Markers and polymorphisms of this invention (e.g., genetic markers suchas single nucleotide polymorphisms, restriction fragment lengthpolymorphisms and simple sequence length polymorphisms) can be detecteddirectly or indirectly. A marker can, for example, be detectedindirectly by detecting or screening for another marker that is tightlylinked (e.g., is located within 2 or 3 centimorgans) of that marker.Additionally, the adjacent gene can be found within an approximately 15cM linkage region surrounding the chromosome, thus spanning over 5, 10or even 15 megabases.

The presence of a marker or polymorphism associated with a gene linkedto, for example, a disease, for example Hirschsprung disease, indicatesthat the subject is afflicted with the disease or is at risk ofdeveloping the disease and/or is at risk of developing the disease. Asubject who is “at increased risk of developing a disease” is one who ispredisposed to the disease, has genetic susceptibility for the diseaseand/or is more likely to develop the disease than subjects in which thedetected polymorphism is absent. A subject who is “at increased risk ofdeveloping a disease at an early age” is one who is predisposed to thedisease, has genetic susceptibility for the disease and/or is morelikely to develop the disease at an age that is earlier than the age ofonset in subjects in which the detected polymorphism is absent. Thus,the marker or polymorphism can also indicate “age of onset” of adisease. The methods described herein can be employed to screen for anytype of disease, including, for example, multigenic diseases, mentalillness, cancer, cardiovascular disease, congenital anomalies, metabolicdisorder inc but not limited to diabetes, susceptibility to infection,drug response, or drug tolerance, and the like.

Subjects, include, for example, mammals and specifically human subjects,including male and female subjects of any age or race. Suitable subjectsinclude, but are not limited to, those who have not previously beendiagnosed with a disease, those who have previously been determined tobe at risk of developing a disease and/or at risk of developing adisease at an early age, and those who have been initially diagnosedwith a disease or who are suspected of having a disease where confirmingand/or prognostic information is desired. Thus, it is contemplated thatthe methods described herein can be used in conjunction with otherclinical diagnostic information known or described in the art used inthe evaluation of subjects with a disease or suspected to be at risk fordeveloping such disease. Subjects may also comprise individuals fromaffected families and individuals from unaffected families.

The present invention discloses methods of screening a subject forHirschsprung disease. The method comprises the steps of: detecting thepresence or absence of a marker for Hirschsprung disease, and/or apolymorphism associated with a gene linked to Hirschsprung disease, withthe presence of such a marker or polymorphism indicating that subjecthas the disease, and/or is at increased risk of developing Hirschsprungdisease.

The detecting step can include determining whether the subject isheterozygous or homozygous for the marker and/or polymorphism, withsubjects who are at least heterozygous for the polymorphism or markerbeing at increased risk for a disease. The step of detecting thepresence or absence of the marker or polymorphism can include the stepof detecting the presence or absence of the marker or polymorphism inboth chromosomes of the subject (i.e., detecting the presence or absenceof one or two alleles containing the marker or polymorphism). More thanone copy of a marker or polymorphism (i.e., subjects homozygous for thepolymorphism) can indicate a greater risk of developing a disease.

The detecting step can be carried out in accordance with knowntechniques (See, e.g., U.S. Pat. Nos. 6,027,896 and 5,508,167 to Roseset al.), such as by collecting a biological sample containing nucleicacid (e.g., DNA) from the subject, and then determining the presence orabsence of nucleic acid encoding or indicative of the polymorphism ormarker in the biological sample. Any biological sample that contains thenucleic acid of that subject can be employed, including tissue samplesand blood samples, with blood cells being a particularly convenientsource.

Determining the presence or absence of a particular polymorphism ormarker can be carried out, for example, with an oligonucleotide probelabeled with a suitable detectable group, and/or by means of anamplification reaction (e.g., with oligonucleotide primers) such as apolymerase chain reaction (PCR) or ligase chain reaction (the product ofwhich amplification reaction can then be detected with a labeledoligonucleotide probe or a number of other techniques). Further, thedetecting step can include the step of determining whether the subjectis heterozygous or homozygous for the particular polymorphism or marker,as described herein. Numerous different oligonucleotide probe assayformats are known which can be employed to carry out the presentinvention. See, e.g., U.S. Pat. No. 4,302,204 to Wahl et al.; U.S. Pat.No. 4,358,535 to Falkow et al.; U.S. Pat. No. 4,563,419 to Ranki et al.;and U.S. Pat. No. 4,994,373 to Stavrianopoulos et al. (the entirecontents of each of which are incorporated herein by reference). Theoligonucleotides can be used to hybridize to the nucleic acids of thisinvention. In some embodiments, the oligonucleotides can be from 2 to100 nucleotides and in other embodiments, the oligonucleotides can be 5,10, 12, 15, 18, 20, 25, 30 35, 40 45 or 50 bases, including any valuebetween 5 and 50 not specifically recited herein (e.g., 16 bases; 34bases). Determining the presence or absence of a particular polymorphismmay also be carried out by sequencing the relevant nucleic acid.

Amplification of a selected, or target, nucleic acid sequence can becarried out by any suitable means. See generally, Kwoh et al., Am.Biotechnol. Lab. 8, 14-25 (1990). Examples of suitable amplificationtechniques include, but are not limited to, polymerase chain reaction,ligase chain reaction, strand displacement amplification (see generallyG. Walker et al., Proc. Natl. Acad. Sci. USA 89, 392-396 (1992); G.Walker et al., Nucleic Acids Res. 20, 1691-1696 (1992)),transcription-based amplification (see D. Kwoh et al., Proc. Natl. AcadSci. USA 86, 1173-1177 (1989)), self-sustained sequence replication (or“35R”) (see J. Guatelli et al., Proc. Natl. Acad Sci. USA 87, 1874-1878(1990)), the Q.beta. replicase system (see P. Lizardi et al.,BioTechnology 6, 1197-1202 (1988)), nucleic acid sequence-basedamplification (or “NASBA”) (see R. Lewis, Genetic Engineering News 12(9), 1 (1992)), the repair chain reaction (or “RCR”) (see R. Lewis,supra), and boomerang DNA amplification (or “BDA”) (see R. Lewis,supra).

As used here, “predicting a genetic interval for a disease,” refers to,for example, identifying an interval associated with a disease using forexample, one or more genetic tests, e.g., of transmission disequilibriumtests (TNTs), linkage, or association studies.

As used here, “comparing orthologous sequences to refine a putativefunctional interval,” refers to, for example the use of at least oneorthologous sequence to the interval. The orthologous sequence refinesthe interval, by, for example, revealing the evolutionarily conservedregions of the interval that are more likely to be under selectivepressure. Thus, differences or mutations found in these regions are morelikely to be associated with disease. One or more orthologous sequencesmay be compared to the interval for further refining. The comparing canbe done by software, hardware or by an individual, for example bymethods described infra in the Examples. Orthologous sequences comprise,for example, vertebrate sequences. Orthologous sequences may also befrom single celled organisms, e.g., yeast, bacteria, viruses, and thelike. Vertebrate sequences comprise, for example, mammalian, reptilian,avian, amphibians, or osteichthyes, and the like.

As used here, “a putative functional interval,” refers to, for example,to an interval shown to be associated by, for example by geneticstudies, including, transmission disequilibrium tests (TNTs), linkage,or association studies. These methods are useful in predicting theinterval.

Sequencing the putative functional interval subjects to identifymutations can be by any known or future developed sequencing methods.

In one embodiment, further comparing is after comparing orthologoussequences.

In one embodiment, one orthlogous sequence is compared to refine theinterval. In another embodiment, at least two orthologous sequences arecompared to refine the interval. In one embodiment, the interval isrefined by the comparison to one or more orthologous sequences by atleast about 50 fold, at least about 40 fold, at least about 30 fold, atleast about 25 fold, at least about 20 fold, at least about 15 fold, byat least about 10 fold, or at least about 5 fold.

“Classifying the refined interval,” as used herein refers to, forexample, defining function or type of sequence that makes up theinterval. The classifications include, for example, one or more ofcoding, non-coding, functional and non-functional sequences. Non-codingsequences may also be classified as functional sequences.

Methods of predicting an interval comprise, for example,multi-analytical approaches including both parametric lod score andnon-parametric affected relative pair methods. Maximized parametric lodscores (MLOD) for each marker may be calculated, for example, by usingVITESSE and HOMOG program packages (O'Connell & Weeks, Nat. Genet.11:402 (1995); Ott, Analysis of Human Genetic Linkage. (The JohnsHopkins University Press, Baltimore, Ed. 3, 1999); The MLOD is the lodscore maximized over the two genetic models tested, allowing for geneticheterogeneity. Dominant and recessive low-penetrance (affecteds-only)models may be considered. Methods may be further based on prevalenceestimates and for example, age-dependent or incomplete penetrance.Disease allele frequencies of 0.001 for the dominant model and 0.20 forthe recessive model may beused. Marker allele frequencies may begenerated, for example, from related or unrelated individuals.Multipoint non-parametric lod scores (LOD*) may be calculated, forexample, using GENEHUNTER-PLUS software (Kong & Cox, Am. J. Hum. Genet.61:1179 (1997)) and sex-averaged intermarker distances. In contrast tonon-parametric linkage approaches which consider allele sharing in pairsof affected siblings [Risch, Am. J. Hum. Genet. 46:222 (1990)],GENEHUNTER-PLUS considers allele sharing across pairs of affectedrelatives (or all affected relatives in a family) in moderately sizedpedigrees.

Depending upon the disease being studied and due to the potentialgenetic heterogeneity in this sample, samples may stratified, or exampleby age of onset.

In one embodiment, an initial complete genomic screen is used toidentify regions of the genome likely harboring susceptibility loci formore thorough analysis. Genetic heterogeneity likely reduces the powerto detect statistically significant evidence of linkage using thetraditional criterion, lod scores of from about 3 to about 1 may be usedin the overall sample for consideration of a region as interesting andwarranting initial follow-up. Regions may be prioritized into twogroups: regions generating lod scores>1 on both two-point and multipointanalyses and while regions with lod scores>1. While this approach mayincrease the number of false-positive results that are examined in moredetail, it decreases the more serious (in this case) false-negativerate.

As used herein, the term “non-human animal” refers to any non-humanvertebrate, birds and more usually mammals, preferably primates, farmanimals such as swine, goats, sheep, donkeys, and horses, rabbits orrodents, more preferably rats or mice. As used herein, the term “animal”is used to refer to any vertebrate, preferable a mammal. Both the terms“animal” and “mammal” expressly embrace human subjects unless precededwith the term “non-human”.

The term “primer” denotes a specific oligonucleotide sequence which iscomplementary to a target nucleotide sequence and used to hybridize tothe target nucleotide sequence. A primer serves as an initiation pointfor nucleotide polymerization catalyzed by either DNA polymerase, RNApolymerase or reverse transcriptase.

The term “probe” denotes a defined nucleic acid segment (or nucleotideanalog segment, e.g., polynucleotide as defined herein) which can beused to identify a specific polynucleotide sequence present in samples,said nucleic acid segment comprising a nucleotide sequence complementaryof the specific polynucleotide sequence to be identified.

The terms “trait” and “phenotype” are used interchangeably herein andrefer to any visible, detectable or otherwise measurable property of anorganism such as symptoms of, or susceptibility to a disease forexample. Typically the terms “trait” or “phenotype” are used herein torefer to symptoms of, or susceptibility to a disease; or to refer to anindividual's response to a drug; or to refer to symptoms of, orsusceptibility to side effects to a drug. In addition, the terms “trait”or “phenotype” may be used herein to refer to symptoms of, orsusceptibility to a disease involving arachidonic acid metabolism; or torefer to an individual's response to an agent acting on arachidonic acidmetabolism; or to refer to symptoms of, or susceptibility to sideeffects to an agent acting on arachidonic acid metabolism.

The term “allele” is used herein to refer to variants of a nucleotidesequence. A biallelic polymorphism has two forms. Typically the firstidentified allele is designated as the original allele whereas otheralleles are designated as alternative alleles. Diploid organisms may behomozygous or heterozygous for an allelic form.

The term “genotype” as used herein refers the identity of the allelespresent in an individual or a sample. In the context of the presentinvention a genotype preferably refers to the description of thebiallelic marker alleles present in an individual or a sample. The term“genotyping” a sample or an individual for a biallelic marker consistsof determining the specific allele or the specific nucleotide carried byan individual at a biallelic marker.

The term “haplotype” refers to one or more alleles present on the samechromosome in an individual or a sample. In the context of the presentinvention a haplotype preferably refers to a combination of biallelicmarker alleles found in a given individual and which may be associatedwith a phenotype.

The term “polymorphism” as used herein refer to the occurrence of two ormore alternative genomic sequences or alleles between or among differentgenomes or individuals. “Polymorphic” refers to the condition in whichtwo or more variants of a specific genomic sequence can be found in apopulation. A “polymorphic site” is the locus at which the variationoccurs. A single nucleotide polymorphism is a single base pair change.Typically a single nucleotide polymorphism is the replacement of onenucleotide by another nucleotide at the polymorphic site. Deletion of asingle nucleotide or insertion of a single nucleotide, also give rise tosingle nucleotide polymorphisms. In the context of the present invention“single nucleotide polymorphism” preferably refers to a singlenucleotide substitution. Typically, between different genomes or betweendifferent individuals, the polymorphic site may be occupied by twodifferent nucleotides.

The terms “biallelic polymorphism” and “biallelic marker” are usedinterchangeably herein to refer to a polymorphism having two alleles ata fairly high frequency in the population, preferably a singlenucleotide polymorphism. A “biallelic marker allele” refers to thenucleotide variants present at a biallelic marker site. Typically thefrequency of the less common allele of the biallelic markers of thepresent invention has been validated to be greater than 1%, preferablythe frequency is greater than 10%, more preferably the frequency is atleast 20% (i.e. heterozygosity rate of at least 0.32), even morepreferably the frequency is at least 30% (i.e. heterozygosity rate of atleast 0.42). A biallelic marker wherein the frequency of the less commonallele is 30% or more is termed a “high quality biallelic marker.”

The term “upstream” is used herein to refer to a location which, istoward the 5′ end of the polynucleotide from a specific reference point.

The terms “base paired” and “Watson & Crick base paired” are usedinterchangeably herein to refer to nucleotides which can be hydrogenbonded to one another be virtue of their sequence identities in a mannerlike that found in double-helical DNA with thymine or uracil residueslinked to adenine residues by two hydrogen bonds and cytosine andguanine residues linked by three hydrogen bonds (See Stryer, L.,Biochemistry, 4th edition, 1995).

The terms “complementary” or “complement thereof” are used herein torefer to the sequences of polynucleotides which is capable of formingWatson & Crick base pairing with another specified polynucleotidethroughout the entirety of the complementary region. This term isapplied to pairs of polynucleotides based solely upon their sequencesand not any particular set of conditions under which the twopolynucleotides would actually bind.

A “promoter” refers to a DNA sequence recognized by the syntheticmachinery of the cell required to initiate the specific transcription ofa gene.

A sequence which is “operably linked” to a regulatory sequence such as apromoter means that said regulatory element is in the correct locationand orientation in relation to the nucleic acid to control RNApolymerase initiation and expression of the nucleic acid of interest.

As used herein, the term “operably linked” refers to a linkage ofpolynucleotide elements in a functional relationship. For instance, apromoter or enhancer is operably linked to a coding sequence if itaffects the transcription of the coding sequence. More precisely, twoDNA molecules (such as a polynucleotide containing a promoter region anda polynucleotide encoding a desired polypeptide or polynucleotide) aresaid to be “operably linked” if the nature of the linkage between thetwo polynucleotides does not (1) result in the introduction of aframe-shift mutation or (2) interfere with the ability of thepolynucleotide containing the promoter to direct the transcription ofthe coding polynucleotide.

The TDT (Spielman et al. (1993) Am J Hum Genet 52: 506-16) is a test forboth association and for linkage, more specifically, it tests forlinkage in the presence of association. Thus, if association does notexist at the locus of interest, linkage will not be detected even if itexists. It is for this reason that the test has been included in thissection. It may be used as an initial test, but is more commonly usedwhen tentative evidence for association has already been identified. Inthis case, a positive result will not only confirm the initialassociation, but also provide evidence for linkage.

Multi-allele Transmission Disequilibrium Test (TDT). TDT is at widelyused method for family-based genetic study (Spielman et al.,Transmission test for linkage disequilibrium: the insulin gene regionand insulin-dependent diabetes mellitus (IDDM), Am. J. Hum. Genet., 1993March; 52 (3):506-16), where parents and children in a family are typed.Testing for linkage in the presence of linkage disequilibrium(association), TDT can be very powerful to identify susceptibilitylocus, especially when the effect is small, as is often the case withcomplex genetic trait. Although the original TDT test was developed toanalyze biallelic markers, new statistics have been developed toaccommodate the availability of multiallelic markers or haplotypes(Spielman et al., The TDT and other family-based tests for linkagedisequilibrium and association, Am. J. Hum. Gent., 1996 November; 59(5):983-9; Curtis and Sham, Model-free linkage analysis usinglikelihoods, Am. J. Hum. Genet., 1995 September; 57(3):703-16;Bickeboller et al., Statistical properties of the allelic and genotypictransmission/disequilibrium test for multiallelic markers, Genet.Epidemiol., 1995; 12(6):865-70). Based on survey performed by Kaplan(Kaplan et al., Power studies for the transmission/disequilibrium testswith multiple alleles, Am. J. Hum. Genet., 1997 March; 60(3):691-702) onthose methods, we have chosen the marginal statistics with onlyheterozygous parents (T.sub.mhet) by Spielman and Ewens (Spielman etal., The TDT and other family-based tests for linkage disequilibrium andassociation, Am. J. Hum. Genet., 1996 November; 59(5):983-9), because ithas equivalent power to the other multi-allelic tests and gives a validchi-square test of linkage. Multi-allele TDT can be readily applied topatterns because of the multi-allele or multi-genotype nature of apattern. In a TDT test on a pattern, each observed permutation of apattern is treated as column and row headings in a TDT contingencytable. Corresponding chi-square value is calculated based on described(Spielman et al., The TDT and other family-based tests for linkagedisequilibrum and association, Am. J. Hum. Genet., 1996 November; 59(5):983-9) and P value is assigned according to default or referencedistribution simulated by Monte Carlo. This statistics can only beapplied to patterns identified in a family-based association studydesign.

The Quantitative Transmission Disequilibrium Test (OTDT) Analysis wasproposed by George et al. [1999] was used to conduct QTDT analysis. Thistest detects linkage in the presence of association. This test detectslinkage in the presence of association. The maximum likelihood estimatesof the parameters and the standard errors of the estimates are computedby numerical methods. These procedures are implemented in the programASSOC of the S.A.G.E. [1998] software package.

Single permutation tests have been used in mapping studies before(Churchill and Doerge 1994, Laitinen et al. 1997, Long and Langley1999). However, if more complex data is to be analyzed, these singlepermutation tests are too expensive and computationally very ineffectiveand even inoperative.

Haplotype-based Haplotype Relative Risk (HHRR). HHRR test is anothermethod for family-based studies (Terwilliger et al., A haplotype-based‘haplotype relative risk’ approach to detecting allelic associations,Hum. Hered., 1992; 42(6):337-46, 1992). It is a variation of theHaplotype Relative Risk (HRR) method, which is genotype-based. InRubinstein's Genotype-based haplotype relative risk (GHRR) method, theaffected children's genotypes at a marker locus are used as cases andartificial genotypes made up of the alleles not transmitted to thechildren from their parents are used as controls. For each haplotype ofinterest, a 2×2 contingency table is constructed and used to record thenumber of cases and controls with or without that haplotype. Incontrast, HHRR utilizes haplotypes rather than genotypes. In particular,transmitted chromosomes are treated as cases and untransmittedchromosomes are used as controls, A 2×2 table is constructed the same asfor GHRR. HHRR can be extended to be applied to patterns because of thesimilarity between a pattern and a multi-marker haplotype. In a HHRRtest for a pattern, the observed counts for the pattern in cases and incontrols and the observed counts for all other permutations on markersin that pattern in cases and controls are recorded in the 2×2contingency table. Upon the calculation of chi-square values, P valuesare assigned according to default distribution or reference distributionsimulated by Monte Carlo.

Statistical significant based on uncorrelated pattern formation(Califano et al., Analysis of gene expression microarrays for phenotypeclassification, Proc. Int. Conf. Intell. Syst. Mol. Biol., 2000;8:75-85).

In another aspect, it will be understood that the invention providessystems that may be employed to compare the orthologous sequences. Thesystems may be machines as well as software tools and can includedevices for processing sequence data as well as data visualization toolswhich can highlight patterns in data that is visually displayed. Thesystem may comprise a conventional data processing platform such as anIBM PC-compatible computer running the Windows operating systems, or aSUN workstation running a Unix operating system. Alternatively, thesystem can comprise a dedicated processing system that includes anembedded programmable data processing system. For example, the systemcan comprise a single board computer system that has been integratedinto a system for sequencing genomic data, identifying SNPs or markers,collecting expression data, or for performing other laboratoryprocesses. The system may also be able to process classifying thesequence data into one or more of coding, non-coding, functional andnon-functional sequences.

As used herein, the term “genome” is intended to mean the fullcomplement of chromosomal DNA found within the nucleus of a eukaryoticcell. The term can also be used to refer to the entire geneticcomplement of a prokaryote, virus, mitochondrion or chloroplast or tothe haploid nuclear genetic complement of a eukaryotic species.

As used herein, the term “genomic DNA” or “gDNA” is intended to mean oneor more chromosomal polymeric deoxyribonucleotide molecules occurringnaturally in the nucleus of a eukaryotic cell or in a prokaryote, virus,mitochondrion or chloroplast and containing sequences that are naturallytranscribed into RNA as well as sequences that are not naturallytranscribed into RNA by the cell. A gDNA of a eukaryotic cell containsat least one centromere, two telomeres, one origin of replication, andone sequence that is not transcribed into RNA by the eukaryotic cellincluding, for example, an intron or transcription promoter. A gDNA of aprokaryotic cell contains at least one origin of replication and onesequence that is not transcribed into RNA by the prokaryotic cellincluding, for example, a transcription promoter. A eukaryotic genomicDNA can be distinguished from prokaryotic, viral or organellar genomicDNA, for example, according to the presence of introns in eukaryoticgenomic DNA and absence of introns in the gDNA of the others.

As used herein, the term “detecting” is intended to mean any method ofdetermining the presence of a particular molecule such as a nucleic acidhaving a specific nucleotide sequence. Techniques used to detect anucleic acid include, for example, hybridization to the sequence to bedetected. However, particular embodiments of this invention need notrequire hybridization directly to the sequence to be detected, butrather the hybridization can occur near the sequence to be detected, oradjacent to the sequence to be detected. Use of the term “near” is meantto imply within about 150 bases from the sequence to be detected. Otherdistances along a nucleic acid that are within about 150 bases andtherefore near include, for example, about 100, 50 40, 30, 20, 19, 18,17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 bases fromthe sequence to be detected. Hybridization can occur at sequences thatare further distances from a locus or sequence to be detected including,for example, a distance of about 250 bases, 500 bases, 1 kilobase ormore up to and including the length of the target nucleic acids orgenome fragments being detected.

Examples of reagents which are useful for detection include, but are notlimited to, radiolabeled probes, fluorophore-labeled probes, quantumdot-labeled probes, chromophore-labeled probes, enzyme-labeled probes,affinity ligand-labeled probes, electromagnetic spin labeled probes,heavy atom labeled probes, probes labeled with nanoparticle lightscattering labels or other nanoparticles or spherical shells, and probeslabeled with any other signal generating label known to those of skillin the art. Non-limiting examples of label moieties useful for detectionin the invention include, without limitation, suitable enzymes such ashorseradish peroxidase, alkaline phosphatase, .beta.-galactosidase, oracetylcholinesterase; members of a binding pair that are capable offorming complexes such as streptavidin/biotin, avidin/biotin or anantigen/antibody complex including, for example, rabbit IgG andanti-rabbit IgG; fluorophores such as umbelliferone, fluorescein,fluorescein isothiocyanate, rhodamine, tetramethyl rhodamine, eosin,green fluorescent protein, erythrosin, coumarin, methyl coumarin,pyrene, malachite green, stilbene, lucifer yellow, Cascade Blue™, TexasRed, dichlorotriazinylamine fluorescein, dansyl chloride, phycoerythrin,fluorescent lanthanide complexes such as those including Europium andTerbium, Cy3, Cy5, molecular beacons and fluorescent derivativesthereof, as well as others known in the art as described, for example,in Principles of Fluorescence Spectroscopy, Joseph R. Lakowicz (Editor),Plenum Pub Corp, 2nd edition (July 1999) and the 6.sup.th Edition of theMolecular Probes Handbook by Richard P. Hoagland; a luminescent materialsuch as luminol; light scattering or plasmon resonant materials such asgold or silver particles or quantum dots; or radioactive materialinclude ¹⁴C, ¹²³I, ¹²⁴I, ¹²⁵I, ¹³¹I, Tc99m, ³⁵S or ³H.

Mutation is meant to encompass single nucleotide polymorphisms (SNPs),mutations, variable number of tandem repeats (VNTRs) and single tandemrepeats (STRs), other polymorphisms, insertions, deletions, splicevariants or any other known genetic markers. Exemplary resources thatprovide known SNPs and other genetic variations include, but are notlimited to, the dbSNP administered by the NCBI and available online atncbi.nlm.nih.gov/SNP/ and the HCVBASE database described in Fredman etal. Nucleic Acids Research, 30:387-91, (2002) and available online athgvbase.cgb.ki.se/.

As used herein, the term “corresponding to,” when used in reference to alocus, is intended to mean having a nucleotide sequence that isidentical or complimentary to the sequence of the locus, or a diagnosticportion thereof. Exemplary diagnostic portions include, for example,nucleic acid sequences adjacent or near to the locus of interest.

As used herein, the term “multiplex” is intended to mean simultaneouslyconducting a plurality of assays on one or more sample. Multiplexing canfurther include simultaneously conducting a plurality of assays in eachof a plurality of separate samples. For example, the number of reactionmixtures analyzed can be based on the number of wells in a multi-wellplate (or holes in a through-hole array) and the number of assaysconducted in each well can be based on the number of probes that contactthe contents of each well. Thus, 96 well, 384 well or 1536 wellmicrotiter plates will utilize composite arrays comprising 96, 384 and1536 individual arrays, although as will be appreciated by those in theart, not each microtiter well need contain an individual array.Depending on the size of the microtiter plate and the size of theindividual array, very high numbers of assays can be run simultaneously;for example, using individual arrays of 2,000 and a 96 well microtiterplate, 192,000 experiments can be done at once; the same arrays in a 384microtiter plate yields 768,000 simultaneous experiments, and a 1536microtiter plate gives 3,072,000 experiments. Although multiplexing hasbeen exemplified with respect to microtiter plates, it will beunderstood that other formats can be used for multiplexing including,for example, those described in U.S. 2002/0102578 A1.

Predictive Medicine

The present invention is based at least in part, on the identificationof alleles that are associated (to a statistically significant extent)with the development of a Hirschsprung disease in subjects. Therefore,detection of these alleles, alone or in conjunction with another meansin a subject indicate that the subject has or is predisposed to thedevelopment of a Hirschsprung disease. For example, polymorphic alleleswhich are associated with a propensity for developing Hirschsprungdisease as described herein or an allele that is in linkagedisequilibrium with one of the aforementioned alleles. In a preferredembodiment, this allelic pattern permits the diagnosis of a Hirschsprungdisease disorder

Detection of the RET+3 allelic variant in an individual suggests anincreased likelihood of developing Hirschsprung disease in comparison toa control individual who does not carry the allele variant. However,because these alleles are in linkage disequilibrium with other alleles,the detection of such other linked alleles can also indicate that thesubject has or is predisposed to the development of a Hirschsprungdisease. These alleles may be identified by known methods in the art.

One of skill in the art can readily identify other alleles (includingpolymorphisms and mutations) that are in linkage disequilibrium with anallele associated with a disease. For example, a nucleic acid samplefrom a first group of subjects without the disease can be collected, aswell as DNA from a second group of subjects with the disease. Thenucleic acid sample can then be compared to identify those alleles thatare over-represented in the second group as compared with the firstgroup, wherein such alleles are presumably associated with the disease.Alternatively, alleles that are in linkage disequilibrium with thedisease associated allele can be identified, for example, by genotypinga large population and performing statistical analysis to determinewhich alleles appear more commonly together than expected. Preferablythe group is chosen to be comprised of genetically related individuals.Genetically related individuals include individuals from the same race,the same ethnic group, or even the same family. As the degree of geneticrelatedness between a control group and a test group increases, so doesthe predictive value of polymorphic alleles which are ever moredistantly linked to a disease-causing allele. This is because lessevolutionary time has passed to allow polymorphisms which are linkedalong a chromosome in a founder population to redistribute throughgenetic cross-over events. Thus, race-specific, ethnic-specific, andeven family-specific diagnostic genotyping assays can be developed toallow for the detection of disease alleles which arose at ever morerecent times in human evolution, e.g., after divergence of the majorhuman races, after the separation of human populations into distinctethnic groups, and even within the recent history of a particular familyline.

Linkage disequilibrium between two polymorphic markers or between onepolymorphic marker and a disease-causing mutation is a meta-stablestate. Absent selective pressure or the sporadic linked reoccurrence ofthe underlying mutational events, the polymorphisms will eventuallybecome disassociated by chromosomal recombination events and willthereby reach linkage equilibrium through the course of human evolution.Thus, the likelihood of finding a polymorphic allele in linkagedisequilibrium with a disease or condition may increase with changes inat least two factors: decreasing physical distance between thepolymorphic marker and the disease-causing mutation, and decreasingnumber of meiotic generations available for the dissociation of thelinked pair. Consideration of the latter factor suggests that, the moreclosely related two individuals are, the more likely they will share acommon parental chromosome or chromosomal region containing the linkedpolymorphisms and the less likely that this linked pair will have becomeunlinked through meiotic cross-over events occurring each generation. Asa result, the more closely related two individuals are, the more likelyit is that widely spaced polymorphisms may be co-inherited. Thus, forindividuals related by common race, ethnicity or family, the reliabilityof ever more distantly spaced polymorphic loci can be relied upon as anindicator of inheritance of a linked disease-causing mutation.

Appropriate probes may be designed to hybridize to a specific genesidentified by methods described herein. For example, the human genomedatabase collects intragenic SNPs, is searchable by sequence andcurrently contains approximately 2,700 entries(http://hgbase.interactiva.de). Also available is a human polymorphismdatabase maintained by the Massachusetts Institute of Technology (MITSNP database (http://www.genome.wi.mit.edu/SNP/human/index.html)). Fromsuch sources SNPs as well as other human polymorphisms may be found.

Detection of Alleles

Many methods are available for detecting mutations. The preferred methodfor detecting a mutation will depend, in part, upon the molecular natureof the mutation. For example, the various allelic forms of the mutationmay differ by a single base-pair of the DNA. Such single nucleotidepolymorphisms (or SNPs) are major contributors to genetic variation,comprising some 80% of all known polymorphisms, and their density in thehuman genome is estimated to be on average 1 per 1,000 base pairs. SNPsare most frequently biallelic-occurring in only two different forms(although up to four different forms of an SNP, corresponding to thefour different nucleotide bases occurring in DNA, are theoreticallypossible). Nevertheless, SNPs are mutationally more stable than otherpolymorphisms, making them suitable for association studies in whichlinkage disequilibrium between markers and an unknown variant is used tomap disease-causing mutations. In addition, because SNPs typically haveonly two alleles, they can be genotyped by a simple plus/minus assayrather than a length measurement, making them more amenable toautomation.

A variety of methods are available for detecting the presence of aparticular single nucleotide polymorphic allele in an individual.Advancements in this field have provided accurate, easy, and inexpensivelarge-scale SNP genotyping. For example, several includ dynamicallele-specific hybridization (DASH), microplate array diagonal gelelectrophoresis (MADGE), pyrosequencing, oligonucleotide-specificligation, the TaqMan system as well as various DNA “chip” technologiessuch as the Affymetrix SNP chips.

Several methods have been developed to facilitate analysis of singlenucleotide polymorphisms. In one embodiment, the single basepolymorphism can be detected by using a specializedexonuclease-resistant nucleotide, as disclosed, e.g., in Mundy, C. R.(U.S. Pat. No. 4,656,127).

In another embodiment of the invention, a solution-based method is usedfor determining the identity of the nucleotide of a polymorphic site,e.g., mutation. Cohen, D. et al. (French Patent 2,650,840; PCT Appln.No. WO91/02087). As in the Mundy method of U.S. Pat. No. 4,656,127, aprimer is employed that is complementary to allelic sequencesimmediately 3′ to a polymorphic site. The method determines the identityof the nucleotide of that site using labeled dideoxynucleotidederivatives, which, if complementary to the nucleotide of thepolymorphic site will become incorporated onto the terminus of theprimer. An alternative method, known as Genetic Bit Analysis or GBA™ isdescribed by Goelet, P. et al. (PCT Appln. No. 92/15712). Severalprimer-guided nucleotide incorporation procedures for assayingpolymorphic sites in DNA have been described (Komher, J. S. et al.,Nucl. Acids. Res. 17:7779-7784 (1989); Sokolov, B. P., Nucl. Acids Res.18:3671 (1990); Syvanen, A.-C., et al., Genomics 8:684-692 (1990);Kuppuswamy, M. N. et al., Proc. Natl. Acad. Sci. (U.S.A.) 88:1143-1147(1991); Prezant, T. R. et al., Hum. Mutat. 1: 159-164 (1992); Ugozzoli,L. et al., GATA 9:107-112 (1992); Nyren, P. et al., Anal. Biochem.208:171-175 (1993)).

For mutations that produce premature termination of protein translation,the protein truncation test (PTT) offers an efficient diagnosticapproach (Roest, et. al., (1993) Hum. Mol. Genet. 2:1719-21; van derLuijt, et. al., (1994) Genomics 20:1-4). For PTT, RNA is initiallyisolated from available tissue and reverse-transcribed, and the segmentof interest is amplified by PCR. The products of reverse transcriptionPCR are then used as a template for nested PCR amplification with aprimer that contains an RNA polymerase promoter and a sequence forinitiating eukaryotic translation. After amplification of the region ofinterest, the unique motifs incorporated into the primer permitsequential in vitro transcription and translation of the PCR products.Upon sodium dodecyl sulfate-polyacrylamide gel electrophoresis oftranslation products, the appearance of truncated polypeptides signalsthe presence of a mutation that causes premature termination oftranslation. In a variation of this technique, DNA (as opposed to RNA)is used as a PCR template when the target region of interest is derivedfrom a single exon.

Any cell type or tissue may be utilized to obtain nucleic acid samplesfor use in the diagnostics described herein. In a preferred embodiment,the DNA sample is obtained from a bodily fluid, e.g, blood, obtained byknown techniques (e.g. venipuncture) or saliva. Alternatively, nucleicacid tests can be performed on dry samples (e.g. hair or skin). Whenusing RNA or protein, the cells or tissues that may be utilized mustexpress an gene.

Diagnostic procedures may also be performed in situ directly upon tissuesections (fixed and/or frozen) of patient tissue obtained from biopsiesor resections, such that no nucleic acid purification is necessary.Nucleic acid reagents may be used as probes and/or primers for such insitu procedures (see, for example, Nuovo, G. J., 1992, PCR in situhybridization: protocols and applications, Raven Press, NY).

In addition to methods which focus primarily on the detection of onenucleic acid sequence, profiles may also be assessed in such detectionschemes. Fingerprint profiles may be generated, for example, byutilizing a differential display procedure, Northern analysis and/orRT-PCR.

A preferred detection method is allele specific hybridization usingprobes overlapping a region of at least one allele and having about 5,10, 20, 25, or 30 nucleotides around the mutation or polymorphic region.In a preferred embodiment of the invention, several probes capable ofhybridizing specifically to other allelic variants involved in aHirschsprung disease are attached to a solid phase support, e.g., a“chip” (which can hold up to about 250,000 oligonucleotides).Oligonucleotides can be bound to a solid support by a variety ofprocesses, including lithography. Mutation detection analysis usingthese chips comprising oligonucleotides, also termed “DNA probe arrays”is described e.g., in Cronin et al. (1996) Human Mutation 7:244. In oneembodiment, a chip comprises all the allelic variants of at least onepolymorphic region of a gene. The solid phase support is then contactedwith a test nucleic acid and hybridization to the specific probes isdetected. Accordingly, the identity of numerous allelic variants of oneor more genes can be identified in a simple hybridization experiment.

These techniques may also comprise the step of amplifying the nucleicacid before analysis. Amplification techniques are known to those ofskill in the art and include, but are not limited to cloning, polymerasechain reaction (PCR), polymerase chain reaction of specific alleles(ASA), ligase chain reaction (LCR), nested polymerase chain reaction,self sustained sequence replication (Guatelli, J. C. et al., 1990, Proc.Natl. Acad. Sci. USA 87:1874-1878), transcriptional amplification system(Kwoh, D. Y. et al., 1989, Proc. Natl. Acad. Sci. USA 86:1173-1177), andQ-Beta Replicase (Lizardi, P. M. et al., 1988, Bio/Technology 6:1197).

Amplification products may be assayed in a variety of ways, includingsize analysis, restriction digestion followed by size analysis,detecting specific tagged oligonucleotide primers in the reactionproducts, allele-specific oligonucleotide (ASO) hybridization, allelespecific 5′ exonuclease detection, sequencing, hybridization, and thelike.

PCR based detection means can include multiplex amplification of aplurality of markers simultaneously. For example, it is well known inthe art to select PCR primers to generate PCR products that do notoverlap in size and can be analyzed simultaneously. Alternatively, it ispossible to amplify different markers with primers that aredifferentially labeled and thus can each be differentially detected. Ofcourse, hybridization based detection means allow the differentialdetection of multiple PCR products in a sample. Other techniques areknown in the art to allow multiplex analyses of a plurality of markers.

In yet another embodiment, any of a variety of sequencing reactionsknown in the art can be used to directly sequence the allele. Exemplarysequencing reactions include those based on techniques developed byMaxim and Gilbert ((1977) Proc. Natl. Acad Sci USA 74:560) or Sanger(Sanger et al (1977) Proc. Nat. Acad. Sci USA 74:5463). It is alsocontemplated that any of a variety of automated sequencing proceduresmay be utilized when performing the subject assays (see, for exampleBiotechniques (1995) 19:448), including sequencing by mass spectrometry(see, for example PCT publication WO 94/16101; Cohen et al. (1996) AdvChromatogr 36:127-162; and Griffin et al. (1993) Appl Biochem Biotechnol38:147-159). It will be evident to one of skill in the art that, forcertain embodiments, the occurrence of only one, two or three of thenucleic acid bases need be determined in the sequencing reaction. Forinstance, A-track or the like, e.g., where only one nucleic acid isdetected, can be carried out. Single molecule sequencing methods mayalso be used.

In a further embodiment, protection from cleavage agents (such as anuclease, hydroxylamine or osmium tetroxide and with piperidine) can beused to detect mismatched bases in RNA/RNA or RNA/DNA or DNA/DNAheteroduplexes (Myers, et al. (1985) Science 230:1242). In general, theart technique of “mismatch cleavage” starts by providing heteroduplexesformed by hybridizing (labeled) RNA or DNA containing the wild-typeallele with the sample. The double-stranded duplexes are treated with anagent which cleaves single-stranded regions of the duplex such as whichwill exist due to base pair mismatches between the control and samplestrands. For instance, RNA/DNA duplexes can be treated with RNase andDNA/DNA hybrids treated with S1 nuclease to enzymatically digest themismatched regions. In other embodiments, either DNA/DNA or RNA/DNAduplexes can be treated with hydroxylamine or osmium tetroxide and withpiperidine in order to digest mismatched regions. After digestion of themismatched regions, the resulting material is then separated by size ondenaturing polyacrylamide gels to determine the site of mutation. See,for example, Cotton et al (1988) Proc. Natl. Acad Sci USA 85:4397; andSaleeba et al (1992) Methods Enzymol. 217:286-295. In a preferredembodiment, the control DNA or RNA can be labeled for detection.

In still another embodiment, the mismatch cleavage reaction employs oneor more proteins that recognize mismatched base pairs in double-strandedDNA (so called “DNA mismatch repair” enzymes). For example, the mutYenzyme of E. coli cleaves A at G/A mismatches and the thymidine DNAglycosylase from HeLa cells cleaves T at G/T mismatches (Hsu et al.(1994) Carcinogenesis 15:1657-1662). According to an exemplaryembodiment, a probe based on an allele of an locus haplotype ishybridized to a cDNA or other DNA product from a test cell(s). Theduplex is treated with a DNA mismatch repair enzyme, and the cleavageproducts, if any, can be detected from electrophoresis protocols or thelike. See, for example, U.S. Pat. No. 5,459,039.

In other embodiments, alterations in electrophoretic mobility will beused to identify alocus allele. For example, single strand conformationpolymorphism (SSCP) may be used to detect differences in electrophoreticmobility between mutant and wild type nucleic acids (Orita et al. (1989)Proc Natl. Acad. Sci USA 86:2766, see also Cotton (1993) Mutat Res285:125-144; and Hayashi (1992) Genet Anal Tech Appl 9:73-79).Single-stranded DNA fragments of sample and control locus alleles aredenatured and allowed to renature. The secondary structure ofsingle-stranded nucleic acids varies according to sequence, theresulting alteration in electrophoretic mobility enables the detectionof even a single base change. The DNA fragments may be labeled ordetected with labeled probes. The sensitivity of the assay may beenhanced by using RNA (rather than DNA), in which the secondarystructure is more sensitive to a change in sequence. In a preferredembodiment, the subject method utilizes heteroduplex analysis toseparate double stranded heteroduplex molecules on the basis of changesin electrophoretic mobility (Keen et al. (1991) Trends Genet 7:5).

In yet another embodiment, the movement of alleles in polyacrylamidegels containing a gradient of denaturant is assayed using denaturinggradient gel electrophoresis (DGGE) (Myers et al. (1985) Nature313:495). When DGGE is used as the method of analysis, DNA will bemodified to insure that it does not completely denature, for example byadding a GC clamp of approximately 40 bp of high-melting GC-rich DNA byPCR. In a further embodiment, a temperature gradient is used in place ofa denaturing agent gradient to identify differences in the mobility ofcontrol and sample DNA (Rosenbaum and Reissner (1987) Biophys Chem265:12753).

Examples of other techniques for detecting alleles include, but are notlimited to, selective oligonucleotide hybridization, selectiveamplification, or selective primer extension. For example,oligonucleotide primers may be prepared in which the known mutation ornucleotide difference (e.g., in allelic variants) is placed centrallyand then hybridized to target DNA under conditions which permithybridization only if a perfect match is found (Saiki et al. (1986)Nature 324:163); Saiki et al (1989) Proc. Natl. Acad. Sci USA 86:6230).Such allele specific oligonucleotide hybridization techniques may beused to test one mutation or polymorphic region per reaction whenoligonucleotides are hybridized to PCR amplified target DNA or a numberof different mutations or polymorphic regions when the oligonucleotidesare attached to the hybridizing membrane and hybridized with labelledtarget DNA.

Alternatively, allele specific amplification technology which depends onselective PCR amplification may be used in conjunction with the instantinvention. Oligonucleotides used as primers for specific amplificationmay carry the mutation or polymorphic region of interest in the centerof the molecule (so that amplification depends on differentialhybridization) (Gibbs et al (1989), Nucleic Acids Res. 17:2437-2448) orat the extreme 3′ end of one primer where, under appropriate conditions,mismatch can prevent, or reduce polymerase extension (Prossner (1993)Tibtech 11:238. In addition it may be desirable to introduce a novelrestriction site in the region of the mutation to create cleavage-baseddetection (Gasparini et al (1992) Mol. Cell Probes 6:1). It isanticipated that in certain embodiments amplification may also beperformed using Taq ligase for amplification (Barany (1991) Proc. Natl.Acad. Sci USA 88:189). In such cases, ligation will occur only if thereis a perfect match at the 3′ end of the 5′ sequence making it possibleto detect the presence of a known mutation at a specific site by lookingfor the presence or absence of amplification.

In another embodiment, identification of the allelic variant is carriedout using an oligonucleotide ligation assay (OLA), as described, e.g.,in U.S. Pat. No. 4,998,617 and in Landegren, U. et al. ((1988) Science241:1077-1080). The OLA protocol uses two oligonucleotides which aredesigned to be capable of hybridizing to abutting sequences of a singlestrand of a target. One of the oligonucleotides is linked to aseparation marker, e.g., biotinylated, and the other is detectablylabeled. If the precise complementary sequence is found in a targetmolecule, the oligonucleotides will hybridize such that their terminiabut, and create a ligation substrate. Ligation then permits the labeledoligonucleotide to be recovered using avidin, or another biotin ligand.Nickerson, D. A. et al. have described a nucleic acid detection assaythat combines attributes of PCR and OLA (Nickerson, D. A. et al. (1990)Proc. Natl. Acad. Sci. USA 87:8923-27). In this method, PCR is used toachieve the exponential amplification of target DNA, which is thendetected using OLA.

Several techniques based on this OLA method have been developed and canbe used to detect alleles of an locus haplotype. For example, U.S. Pat.No. 5,593,826 discloses an OLA using an oligonucleotide having 3′-aminogroup and a 5′-phosphorylated oligonucleotide to form a conjugate havinga phosphoramidate linkage. In another variation of OLA described in Tobeet al. ((1996) Nucleic Acids Res 24: 3728), OLA combined with PCRpermits typing of two alleles in a single microtiter well. By markingeach of the allele-specific primers with a unique hapten, i.e.digoxigenin and fluorescein, each OLA reaction can be detected by usinghapten specific antibodies that are labeled with different enzymereporters, alkaline phosphatase or horseradish peroxidase. This systempermits the detection of the two alleles using a high throughput formatthat leads to the production of two different colors.

Another embodiment of the invention is directed to kits for detecting apredisposition for developing a Hirschsprung disease. This kit maycontain one or more oligonucleotides, including 5′ and 3′oligonucleotides that hybridize 5′ and 3′ to at least one allele of anlocus haplotype. PCR amplification oligonucleotides should hybridizebetween 25 and 2500 base pairs apart, preferably between about 100 andabout 500 bases apart, in order to produce a PCR product of convenientsize for subsequent analysis. Kits may also include sequence reagentsand other reagents necessary for the methods described herein.

Exemplary primers for use in the diagnostic methods include RETX10F:59-TTCCCTGAGGAGGAGAAGTGC-39 and RETX12R: 59-CACTTTTCCAAATTCGCCTT-39.Other exemplary primers may be found, for example, in Minerva M.Carrasquillo et al., “Genome-wide association study and mouse modelidentify interaction between RET and EDNRB pathways in Hirschsprungdisease,” nature genetics, vol. 32 (2002); Stacey Bolk et al., “A humanmodel for multigenic inheritance: Phenotypic expression in Hirschsprungdisease requires both the RET gene and a new 9q31 locus,” PNAS, vol. 97,pp 268-273 (2000); and Stacey Bolk Gabriel, et al., “Segregation atthree loci explains familial and population risk in Hirschsprungdisease,” Nature Genetics, vol 31 (2002).

The design of additional oligonucleotides for use in the amplificationand detection of polymorphic alleles by the method of the invention isfacilitated by the availability of updated sequence information fromhuman chromosomes. Suitable primers for the detection of a humanpolymorphism in these genes can be readily designed using sequenceinformation and standard techniques known in the art for the design andoptimization of primers sequences. Optimal design of such primersequences can be achieved, for example, by the use of commerciallyavailable primer selection programs such as Primer 2.1, Primer 3 orGeneFisher (See also, Nicklin M. H. J., Weith A. Duff G. W., “A PhysicalMap of the Region Encompassing the Human Interleukin-1.alpha.,interleukin-1.beta., and Interleukin-1 Receptor Antagonist Genes”Genomics 19: 382 (1995); Nothwang H. G., et al. “Molecular Cloning ofthe Interleukin-1 gene Cluster: Construction of an Integrated YAC/PACContig and a partial transcriptional Map in the Region of Chromosome2q13” Genomics 41: 370 (1997); Clark, et al. (1986) Nucl. Acids. Res.,14:7897-7914 [published erratum appears in Nucleic Acids Res., 15:868(1987) and the Genome Database (GDB) project at the URLhttp://www.gdb.org).

Therapeutics

Modulators of affected genes or a protein encoded by a gene that is inlinkage disequilibrium with an gene with a mutation of the inventiongene can comprise any type of compound, including a protein, peptide,peptidomimetic, small molecule, or nucleic acid. Preferred agonistsinclude nucleic acids, proteins or a small molecule. Preferredantagonists, which can be identified, for example, using the assaysdescribed herein, include nucleic acids (e.g. single (antisense) ordouble stranded (triplex) DNA or PNA and ribozymes), protein (e.g.antibodies) and small molecules that act to modulate, upregulate,suppress or inhibit transcription and/or protein activity.

Effective Dose

Toxicity and therapeutic efficacy of such compounds can be determined bystandard pharmaceutical procedures in cell cultures or experimentalanimals, e.g., for determining The LD₅₀ (the dose lethal to 50% of thepopulation) and the E₅₀ (the dose therapeutically effective in 50% ofthe population). The dose ratio between toxic and therapeutic effects isthe therapeutic index and it can be expressed as the ratio LD₅₀/ED₅₀.Compounds which exhibit large therapeutic indices are preferred. Whilecompounds that exhibit toxic side effects may be used, care should betaken to design a delivery system that targets such compounds to thesite of affected tissues in order to minimize potential damage touninfected cells and, thereby, reduce side effects.

Data obtained from the cell culture assays and animal studies can beused in formulating a range of dosage for use in humans. The dosage ofsuch compounds lies preferably within a range of circulatingconcentrations that include the ED₅₀ with little or no toxicity. Thedosage may vary within this range depending upon the dosage formemployed and the route of administration utilized. For any compound usedin the method of the invention, the therapeutically effective dose canbe estimated initially from cell culture assays. A dose may beformulated in animal models to achieve a circulating plasmaconcentration range that includes the IC₅₀ (i.e., the concentration ofthe test compound which achieves a half-maximal inhibition of symptoms)as determined in cell culture. Such information can be used to moreaccurately determine useful doses in humans. Levels in plasma may bemeasured, for example, by high performance liquid chromatography.

Formulation and Use

Compositions for use in accordance with the present invention may beformulated in a conventional manner using one or more physiologicallyacceptable carriers or excipients. Thus, the compounds and theirphysiologically acceptable salts and solvates may be formulated foradministration by, for example, injection, inhalation or insufflation(either through the mouth or the nose) or oral, buccal, parenteral orrectal administration.

For such therapy, the compounds of the invention can be formulated for avariety of loads of administration, including systemic and topical orlocalized administration. Techniques and formulations generally may befound in Remington's Pharmaceutical Sciences, Meade Publishing Co.,Easton, Pa. For systemic administration, injection is preferred,including intramuscular, intravenous, intraperitoneal, and subcutaneous.For injection, the compounds of the invention can be formulated inliquid solutions, preferably in physiologically compatible buffers suchas Hank's solution or Ringer's solution. In addition, the compounds maybe formulated in solid form and redissolved or suspended immediatelyprior to use. Lyophilized forms are also included.

For oral administration, the compositions may take the form of, forexample, tablets or capsules prepared by conventional means withpharmaceutically acceptable excipients such as binding agents (e.g.,pregelatinised maize starch, polyvinylpyrrolidone or hydroxypropylmethylcellulose); fillers (e.g., lactose, microcrystalline cellulose orcalcium hydrogen phosphate); lubricants (e.g., magnesium stearate, talcor silica); disintegrants (e.g., potato starch or sodium starchglycolate); or wetting agents (e.g., sodium lauryl sulfate). The tabletsmay be coated by methods well known in the art. Liquid preparations fororal administration may take the form of, for example, solutions, syrupsor suspensions, or they may be presented as a dry product forconstitution with water or other suitable vehicle before use. Suchliquid preparations may be prepared by conventional means withpharmaceutically acceptable additives such as suspending agents (e.g.,sorbitol syrup, cellulose derivatives or hydrogenated edible fats);emulsifying agents (e.g., lecithin or acacia); non-aqueous vehicles(e.g., ationd oil, oily esters, ethyl alcohol or fractionated vegetableoils); and preservatives (e.g., methyl or propyl-p-hydroxybenzoates orsorbic acid). The preparations may also contain buffer salts, flavoring,coloring and sweetening agents as appropriate.

Preparations for oral administration may be suitably formulated to givecontrolled release of the active compound. For buccal administration thecompositions may take the form of tablets or lozenges formulated inconventional manner. For administration by inhalation, the compounds foruse according to the present invention are conveniently delivered in theform of an aerosol spray presentation from pressurized packs or anebuliser, with the use of a suitable propellant, e.g.,dichlorodifluoromethane, trichlorofluoromethane,dichlorotetrafluoroethane, carbon dioxide or other suitable gas. In thecase of a pressurized aerosol the dosage unit may be determined byproviding a valve to deliver a metered amount. Capsules and cartridgesof e.g., gelatin for use in an inhaler or insufflator may be formulatedcontaining a powder mix of the compound and a suitable powder base suchas lactose or starch.

The compounds may be formulated for parenteral administration byinjection, e.g., by bolus injection or continuous infusion. Formulationsfor injection may be presented in unit dosage form, e.g., in ampoules orin multi-dose containers, with an added preservative. The compositionsmay take such forms as suspensions, solutions or emulsions in oily oraqueous vehicles, and may contain formulating agents such as suspending,stabilizing and/or dispersing agents. Alternatively, the activeingredient may be in powder form for constitution with a suitablevehicle, e.g., sterile pyrogen-free water, before use.

The compounds may also be formulated in rectal compositions such assuppositories or retention enemas, e.g., containing conventionalsuppository bases such as cocoa butter or other glycerides.

In addition to the formulations described previously, the compounds mayalso be formulated as a depot preparation. Such long acting formulationsmay be administered by implantation (for example subcutaneously orintramuscularly) or by intramuscular injection. Thus, for example, thecompounds may be formulated with suitable polymeric or hydrophobicmaterials (for example as an emulsion in an acceptable oil) or ionexchange resins, or as sparingly soluble derivatives, for example, as asparingly soluble salt. Other suitable delivery systems includemicrospheres which offer the possibility of local noninvasive deliveryof drugs over an extended period of time. This technology utilizesmicrospheres of precapillary size which can be injected via a coronarycatheter into any selected part of the e.g. heart or other organswithout causing inflammation or ischemia. The administered therapeuticis slowly released from these microspheres and taken up by surroundingtissue cells (e.g. endothelial cells).

Systemic administration can also be transmucosal or transdermal. Fortransmucosal or transdermal administration, penetrants appropriate tothe barrier to be permeated are used in the formulation. Such penetrantsare generally known in the art, and include, for example, fortransmucosal administration bile salts and fusidic acid derivatives. Inaddition, detergents may be used to facilitate permeation. Transmucosaladministration may be through nasal sprays or using suppositories. Fortopical administration, the oligomers of the invention are formulatedinto ointments, salves, gels, or creams as generally known in the art. Awash solution can be used locally to treat an injury or inflammation toaccelerate healing.

The compositions may, if desired, be presented in a pack or dispenserdevice which may contain one or more unit dosage forms containing theactive ingredient. The pack may for example comprise metal or plasticfoil, such as a blister pack. The pack or dispenser device may beaccompanied by instructions for administration.

Assays to Identify Hirschsprung Disease Therapeutics

Based on the identification of mutations that cause or contribute to thedevelopment of Hirschsprung disease, the invention further featurescell-based or cell free assays, e.g., for identifying Hirschsprungdisease therapeutics. In one embodiment, a cell expressing an receptor,or a receptor for a protein that is encoded by a gene which is inlinkage disequilibrium with an gene, on the outer surface of itscellular membrane is incubated in the presence of a test compound aloneor in the presence of a test compound and another protein and theinteraction between the test compound and the receptor or between theprotein (preferably a tagged protein) and the receptor is detected,e.g., by using a microphysiometer (McConnell et al. (1992) Science257:1906). An interaction between the receptor and either the testcompound or the protein is detected by the microphysiometer as a changein the acidification of the medium. This assay system thus provides ameans of identifying molecular antagonists which, for example, functionby interfering with protein-receptor interactions, as well as molecularagonist which, for example, function by activating a receptor.

Cellular or cell-free assays can also be used to identify compoundswhich modulate expression of a gene or a gene in linkage disequilibriumtherewith, modulate translation of an mRNA, or which modulate thestability of an mRNA or protein. Accordingly, in one embodiment, a cellwhich is capable of producing protein is incubated with a test compoundand the amount of protein produced in the cell medium is measured andcompared to that produced from a cell which has not been contacted withthe test compound. The specificity of the compound vis a vis the proteincan be confirmed by various control analysis, e.g., measuring theexpression of one or more control genes. In particular, this assay canbe used to determine the efficacy of antisense, ribozyme and triplexcompounds.

Cell-free assays can also be used to identify compounds which arecapable of interacting with a protein, to thereby modify the activity ofthe protein. Such a compound can, e.g., modify the structure of aprotein thereby effecting its ability to bind to a receptor. In apreferred embodiment, cell-free assays for identifying such compoundsconsist essentially in a reaction mixture containing a protein and atest compound or a library of test compounds in the presence or absenceof a binding partner. A test compound can be, e.g., a derivative of abinding partner, e.g., a biologically inactive target peptide, or asmall molecule.

Accordingly, one exemplary screening assay of the present inventionincludes the steps of contacting a protein or functional fragmentthereof with a test compound or library of test compounds and detectingthe formation of complexes. For detection purposes, the molecule can belabeled with a specific marker and the test compound or library of testcompounds labeled with a different marker. Interaction of a testcompound with a protein or fragment thereof can then be detected bydetermining the level of the two labels after an incubation step and awashing step. The presence of two labels after the washing step isindicative of an interaction.

An interaction between molecules can also be identified by usingreal-time BIA (Biomolecular Interaction Analysis, Pharmacia BiosensorAB) which detects surface plasmon resonance (SPR), an opticalphenomenon. Detection depends on changes in the mass concentration ofmacromolecules at the biospecific interface, and does not require anylabeling of interactants. In one embodiment, a library of test compoundscan be immobilized on a sensor surface, e.g., which forms one wall of amicro-flow cell. A solution containing the protein or functionalfragment thereof is then flown continuously over the sensor surface. Achange in the resonance angle as shown on a signal recording, indicatesthat an interaction has occurred. This technique is further described,e.g., in BIAtechnology Handbook by Pharmacia.

Another exemplary screening assay of the present invention includes thesteps of (a) forming a reaction mixture including: (i) aproteinassociated with a disease identified by a method described herein orother protein, (ii) an appropriate receptor, and (iii) a test compound;and (b) detecting interaction of the protein and receptor. Astatistically significant change (potentiation or inhibition) in theinteraction of the protein and receptor in the presence of the testcompound, relative to the interaction in the absence of the testcompound, indicates a potential antagonist (inhibitor). The compounds ofthis assay can be contacted simultaneously. Alternatively, a protein canfirst be contacted with a test compound for an appropriate amount oftime, following which the receptor is added to the reaction mixture. Theefficacy of the compound can be assessed by generating dose responsecurves from data obtained using various concentrations of the testcompound. Moreover, a control assay can also be performed to provide abaseline for comparison.

Complex formation between a protein and receptor may be detected by avariety of techniques. Modulation of the formation of complexes can bequantitated using, for example, detectably labeled proteins such asradiolabeled, fluorescently labeled, or enzymatically labeled proteinsor receptors, by immunoassay, or by chromatographic detection.

It may be desirable to immobilize either the protein or the receptor tofacilitate separation of complexes from uncomplexed forms of one or bothof the proteins, as well as to accommodate automation of the assay.Binding of protein and receptor can be accomplished in any vesselsuitable for containing the reactants. Examples include microtitreplates, test tubes, and micro-centrifuge tubes. In one embodiment, afusion protein can be provided which adds a domain that allows theprotein to be bound to a matrix. For example, glutathione-S-transferasefusion proteins can be adsorbed onto glutathione sepharose beads (SigmaChemical, St. Louis, Miss.) or glutathione derivatized microtitreplates, which are then combined with the receptor, e.g. an ³⁵S-labeledreceptor, and the test compound, and the mixture incubated underconditions conducive to complex formation, e.g. at physiologicalconditions for salt and pH, though slightly more stringent conditionsmay be desired. Following incubation, the beads are washed to remove anyunbound label, and the matrix immobilized and radiolabel determineddirectly (e.g. beads placed in scintillant), or in the supernatant afterthe complexes are subsequently dissociated. Alternatively, the complexescan be dissociated from the matrix, separated by SDS-PAGE, and the levelof protein or receptor found in the bead fraction quantitated from thegel using standard electrophoretic techniques such as described in theappended examples. Other techniques for immobilizing proteins onmatrices are also available for use in the subject assay. For instance,either protein or receptor can be immobilized utilizing conjugation ofbiotin and streptavidin.

Transgenic animals can also be made to identify agonists and antagonistsor to confirm the safety and efficacy of a candidate therapeutic.Transgenic animals of the invention can include non-human animalscontaining a Hirschsprung disease causative mutation under the controlof an appropriate endogenous promoter or under the control of aheterologous promoter.

The transgenic animals can also be animals containing a transgene, suchas reporter gene, under the control of an appropriate promoter orfragment thereof. These animals are useful, e.g., for identifying drugsthat modulate production of a protein, such as by modulating geneexpression. Methods for obtaining transgenic non-human animals are wellknown in the art. In preferred embodiments, the expression of theHirschsprung disease causative mutation is restricted to specificsubsets of cells, tissues or developmental stages utilizing, forexample, cis-acting sequences that control expression in the desiredpattern. In the present invention, such mosaic expression of a proteincan be essential for many forms of lineage analysis and can additionallyprovide a means to assess the effects of, for example, expression levelwhich might grossly alter development in small patches of tissue withinan otherwise normal embryo. Toward this end, tissue-specific regulatorysequences and conditional regulatory sequences can be used to controlexpression of the mutation in certain spatial patterns. Moreover,temporal patterns of expression can be provided by, for example,conditional recombination systems or prokaryotic transcriptionalregulatory sequences. Genetic techniques, which allow for the expressionof a mutation can be regulated via site-specific genetic manipulation invivo, are known to those skilled in the art.

The transgenic animals of the present invention all include within aplurality of their cells a Hirschsprung disease causative mutationtransgene of the present invention, which transgene alters the phenotypeof the “host cell”. In an illustrative embodiment, either the cre/loxPrecombinase system of bacteriophage P1 (Lakso et al. (1992) PNAS89:6232-6236; Orban et al. (1992) PNAS 89:6861-6865) or the FLPrecombinase system of Saccharomyces cerevisiae (O'Gorman et al. (1991)Science 251:1351-1355; PCT publication WO 92/15694) can be used togenerate in vivo site-specific genetic recombination systems. Crerecombinase catalyzes the site-specific recombination of an interveningtarget sequence located between loxP sequences loxP sequences are 34base pair nucleotide repeat sequences to which the Cre recombinase bindsand are required for Cre recombinase mediated genetic recombination. Theorientation of loxP sequences determines whether the intervening targetsequence is excised or inverted when Cre recombinase is present(Abremski et al. (1984) J. Biol. Chem. 259:1509-1514); catalyzing theexcision of the target sequence when the loxP sequences are oriented asdirect repeats and catalyzes inversion of the target sequence when loxPsequences are oriented as inverted repeats.

Accordingly, genetic recombination of the target sequence is dependenton expression of the Cre recombinase. Expression of the recombinase canbe regulated by promoter elements which are subject to regulatorycontrol, e.g., tissue-specific, developmental stage-specific, inducibleor repressible by externally added agents. This regulated control willresult in genetic recombination of the target sequence only in cellswhere recombinase expression is mediated by the promoter element. Thus,the activation of expression of the causative mutation transgene can beregulated via control of recombinase expression.

Use of the cre/loxP recombinase system to regulate expression of acausative mutation transgene requires the construction of a transgenicanimal containing transgenes encoding both the Cre recombinase and thesubject protein. Animals containing both the Cre recombinase and theHirschsprung disease causative mutation transgene can be providedthrough the construction of “double” transgenic animals. A convenientmethod for providing such animals is to mate two transgenic animals eachcontaining a transgene.

Similar conditional transgenes can be provided using prokaryoticpromoter sequences which require prokaryotic proteins to be simultaneousexpressed in order to facilitate expression of the transgene. Exemplarypromoters and the corresponding trans-activating prokaryotic proteinsare given in U.S. Pat. No. 4,833,080.

Moreover, expression of the conditional transgenes can be induced bygene therapy-like methods wherein a gene encoding the transactivatingprotein, e.g. a recombinase or a prokaryotic protein, is delivered tothe tissue and caused to be expressed, such as in a cell-type specificmanner. By this method, the transgene could remain silent into adulthooduntil “turned on” by the introduction of the transactivator.

In an exemplary embodiment, the “transgenic non-human animals” of theinvention are produced by introducing transgenes into the germline ofthe non-human animal. Embryonal target cells at various developmentalstages can be used to introduce transgenes. Different methods are useddepending on the stage of development of the embryonal target cell. Thespecific line(s) of any animal used to practice this invention areselected for general good health, good embryo yields, good pronuclearvisibility in the embryo, and good reproductive fitness. In addition,the haplotype is a significant factor. For example, when transgenic miceare to be produced, strains such as C57BL/6 or FVB lines are often used(Jackson Laboratory, Bar Harbor, Me.). Preferred strains are those withH-2.sup.b, H-2.sup.d or H-2.sup.q haplotypes such as C57BL/6 or DBA/1.The line(s) used to practice this invention may themselves betransgenics, and/or may be knockouts (i.e., obtained from animals whichhave one or more genes partially or completely suppressed).

In one embodiment, the transgene construct is introduced into a singlestage embryo. The zygote is the best target for microinjection. In themouse, the male pronucleus reaches the size of approximately 20micrometers in diameter which allows reproducible injection of 1-2 pl ofDNA solution. The use of zygotes as a target for gene transfer has amajor advantage in that in most cases the injected DNA will beincorporated into the host gene before the first cleavage (Brinster etal. (1985) PNAS 82:4438-4442). As a consequence, all cells of thetransgenic animal will carry the incorporated transgene. This will ingeneral also be reflected in the efficient transmission of the transgeneto offspring of the founder since 50% of the germ cells will harbor thetransgene. Transgenic animals may be made by any known or futuredeveloped technique, which would be known to one of skill in the art.

Transgenic offspring of the surrogate host may be screened for thepresence and/or expression of the transgene by any suitable method.Screening is often accomplished by Southern blot or Northern blotanalysis, using a probe that is complementary to at least a portion ofthe transgene. Western blot analysis using an antibody against theprotein encoded by the transgene may be employed as an alternative oradditional method for screening for the presence of the transgeneproduct. Typically, DNA is prepared from tail tissue and analyzed bySouthern analysis or PCR for the transgene. Alternatively, the tissuesor cells believed to express the transgene at the highest levels aretested for the presence and expression of the transgene using Southernanalysis or PCR, although any tissues or cell types may be used for thisanalysis.

Alternative or additional methods for evaluating the presence of thetransgene include, without limitation, suitable biochemical assays suchas enzyme and/or immunological assays, histological stains forparticular marker or enzyme activities, flow cytometric analysis, andthe like. Analysis of the blood may also be useful to detect thepresence of the transgene product in the blood, as well as to evaluatethe effect of the transgene on the levels of various types of bloodcells and other blood constituents.

Progeny of the transgenic animals may be obtained by mating thetransgenic animal with a suitable partner, or by in vitro fertilizationof eggs and/or sperm obtained from the transgenic animal. Where matingwith a partner is to be performed, the partner may or may not betransgenic and/or a knockout; where it is transgenic, it may contain thesame or a different transgene, or both. Alternatively, the partner maybe a parental line. Where in vitro fertilization is used, the fertilizedembryo may be implanted into a surrogate host or incubated in vitro, orboth. Using either method, the progeny may be evaluated for the presenceof the transgene using methods described above, or other appropriatemethods.

The transgenic animals produced in accordance with the present inventionwill include exogenous genetic material. Further, in such embodimentsthe sequence will be attached to a transcriptional control element,e.g., a promoter, which preferably allows the expression of thetransgene product in a specific type of cell.

Retroviral infection can also be used to introduce the transgene into anon-human animal. The developing non-human embryo can be cultured invitro to the blastocyst stage. During this time, the blastomeres can betargets for retroviral infection (Jaenich, R. (1976) PNAS 73:1260-1264).Efficient infection of the blastomeres is obtained by enzymatictreatment to remove the zona pellucida (Manipulating the Mouse Embryo,Hogan eds. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor,1986). The viral vector system used to introduce the transgene istypically a replication-defective retrovirus carrying the transgene(Jahner et al. (1985) PNAS 82:6927-6931; Van der Putten et al. (1985)PNAS 82:6148-6152). Transfection is easily and efficiently obtained byculturing the blastomeres on a monolayer of virus-producing cells (Vander Putten, supra; Stewart et al. (1987) EMBO J. 6:383-388).Alternatively, infection can be performed at a later stage. Virus orvirus-producing cells can be injected into the blastocoele (Jahner etal. (1982) Nature 298:623-628). Most of the founders will be mosaic forthe transgene since incorporation occurs only in a subset of the cellswhich formed the transgenic non-human animal. Further, the founder maycontain various retroviral insertions of the transgene at differentpositions in the genome which generally will segregate in the offspring.In addition, it is also possible to introduce transgenes into the germline by intrauterine retroviral infection of the midgestation embryo(Jahner et al. (1982) supra).

A third type of target cell for transgene introduction is the embryonalstem cell (ES). ES cells are obtained from pre-implantation embryoscultured in vitro and fused with embryos (Evans et al. (1981) Nature292:154-156; Bradley et al. (1984) Nature 309:255-258; Gossler et al.(1986) PNAS 83: 9065-9069; and Robertson et al. (1986) Nature322:445-448). Transgenes can be efficiently introduced into the ES cellsby DNA transfection or by retrovirus-mediated transduction. Suchtransformed ES cells can thereafter be combined with blastocysts from anon-human animal. The ES cells thereafter colonize the embryo andcontribute to the germ line of the resulting chimeric animal. For reviewsee Jaenisch, R. (1988) Science 240:1468-1474.

The present invention is further illustrated by the following exampleswhich should not be construed as limiting in any way. The contents ofall cited references (including literature references, issued patents,published patent applications as cited throughout this application) arehereby expressly incorporated by reference. The practice of the presentinvention will employ, unless otherwise indicated, conventionaltechniques that are within the skill of the art. Such techniques areexplained fully in the literature. See, for example, Molecular Cloning ALaboratory Manual, (2nd ed., Sambrook, Fritsch and Maniatis, eds., ColdSpring Harbor Laboratory Press: 1989); DNA Cloning, Volumes I and II (D.N. Glover ed., 1985); Oligonucleotide Synthesis (M. J. Gait ed., 1984);U.S. Pat. No. 4,683,195; U.S. Pat. No. 4,683,202; and Nucleic AcidHybridization (B. D. Hames & S. J. Higgins eds., 1984).

The processes and systems described above can be realized as a softwarecomponent operating on a conventional data processing system such as aUnix workstation. In that embodiment, the process can be implemented asa C language computer program, or a computer program written in any highlevel language including C++, Fortran, Java or Basic. Additionally, inan embodiment where microcontrollers or DSPs are employed, the processcan be realized as a computer program written in microcode or written ina high level language and compiled down to microcode that can beexecuted on the platform employed. The development of such systems isknown to those of skill in the art, and such techniques are set forth inDigital Signal Processing Applications with the TMS320 Family, Volumes,I, II, and III, Texas Instruments (1990). Additionally, generaltechniques for high level programming are known, and set forth in, forexample, Stephen G. Kochan, Programming in C, Hayden Publishing (1993).It is noted that DSPs are particularly suited for implementing signalprocessing functions, including preprocessing functions such as imageenhancement through adjustments in contrast, edge definition andbrightness. Developing code for the DSP and microcontroller systemsfollows from principles well known in the art.

Those skilled in the art will know or be able to ascertain using no morethan routine experimentation, many equivalents to the embodiments andpractices described herein. For example, the systems and methodsdescribed herein may be employed in other applications includingfinancial applications, engineering applications and other applicationsthat would benefit from having patterns found within a large dataset.Accordingly, it will be understood that the invention is not to belimited to the embodiments disclosed herein, but is to be understoodfrom the following claims, which are to be interpreted as broadly asallowed under the law.

EXAMPLES Family-Based Association Studies

Genome sequence data (http://genome.ucsc.edu: build 35) identifies twoadditional genes in the 350-kb region surrounding RET. GALNACT-2, achondroitin N-acetylgalactosaminyltransferase^(9,10), contains 8 exonsspanning 46.8-kb and begins 9-kb from the last RET exon. Thirteen exonsencode RASGEF1A, a predicted guanyl-nucleotide exchange factor whichspans 72-kb and begins 65-kb 3′ to RET. To genetically refine theassociation within this locus, we initially genotyped 28 singlenucleotide polymorphisms (SNP) spanning 175-kb in 126 HSCR-affectedindividuals and their parents, ascertained from the general outbredpopulation (Table 1). The genomic interval encompasses RET, GALNACT-2and RASGEF1A.

TABLE 1 Analysis of disease associations All affected individuals Maleoffspring Female offspring Gene Marker dbSNP ID A1 A2 T U

T U T U

5′ RET RET − 6 rs3097565 G T  45 43 0.51 27 32 0.46  18 11 0.62 RET − 5rs2742250 G C  56 44 0.56 41 27 0.60  15 17 0.47 RET − 4 rs3026707 A G 45 33 0.58 32 19 0.63  13 14 0.48 RET − 3 rs3026720 T C  38 29 0.57 2717 0.61  11 12 0.48 RET − 2† rs741763 G C  69 26 0.73** 47 14 0.77**  2212 0.65 RET − 1† rs2505997 C T  57 19 0.75** 43 10 0.81**  14  9 0.61RET int1 RET + 1† rs2435365 T C  76 29 0.72** 53 17 0.76**  23 12 0.66RET + 2† rs2435364 A G  73 27 0.73** 50 15 0.77**  23 12 0.66 1.1Sfcl†rs2435362 A C 100 27 0.79*** 72 14 0.84***  28 13 0.68* RET + 3†‡rs2435357 T C 101 25 0.80** 73 12 0.86***  28 13 0.68* RET + 4† rs752975A G  74 29 0.72** 51 17 0.75**  23 12 0.66 INT1.4b† rs2505535 G A  92 280.77*** 68 14 0.83***  24 14 0.63 RET X2Eagl† rs1800658 A G  96 280.77*** 72 14 0.84***  24 14 0.63 protein-coding INT8 rs3026750 G A  6040 0.60* 42 22 0.66*  18 18 0.50 region X13Taql rs1803861 G T  59 380.61* 41 21 0.66*  18 17 0.51 l18Bbvl rs2742237 C G  59 28 0.68* 41 140.75** 113 14 0.56 l18Styl rs2742239 A G  52 27 0.68* 34 12 0.74**  1815 0.55 l19BSgl rs2075912 T C  55 25 0.69* 37 12 0.76**  18 13 0.58GALNACT-2 GN − 1 rs3026787 G A  17 14 0.55 15 10 0.60  2  4 0.33 GN + 1rs4948705 C T  59 29 0.67* 40 13 0.75**  19 16 054 GN + 2 rs1864393 A G 35 15 0.70* 27  9 0.75**  8  6 0.57 GN + 3 rs2435337 G C  57 29 0.66*39 14 0.74**  18 15 0.55 GN + 4 rs2505556 C T  63 59 0.52 42 39 0.52  2120 0.51 GN + 5 rs2435384 G T  57 39 0.59 39 21 0.65*  18 18 0.50 GN + 6rs2435381 T C  55 41 0.57 37 28 0.57  18 13 0.58 RASGEF1A RAS + 2rs1254958 T C  56 27 0.67* 38 13 0.75**  18 14 0.56 RAS + 1 rs1254965 TC  56 27 0.67* 38 12 0.76**  18 15 0.55 RAS − 1 rs1272142 G T  55 410.57 38 22 0.63*  17 19 0.47 RAS − 2 rs1955356 A T  51 39 0.57 33 240.58  18 15 0.55

Transmission Disequilibrium Tests (TDT)¹¹ on each SNP demonstratedstatistically significant disease associations spanning a regionimmediately 5′ of RET through RASGEF1A (FIG. 1 a; Table 1).Specifically, 13 of 17 RET SNPs, 3 of 7 GALNACT-2 SNPs and 2 of 4RASGEF1A SNPs tested are significantly associated with HSCR (Table 1),reflecting the high background linkage disequilibrium (LD) in thisregion (data not shown). However, the greatest statistical significance,and more importantly, the largest transmission distortions (τ≧7),occurred among 8 SNPs in a 27.6-kb segment from 4.2-kb 5′ of RET throughRET exon 2; (FIG. 1 a). Within this region the highest association waswithin RET intron 1.

Three re-sequencing experiments were performed and analyzed to identifyadditional variants, with particular emphasis given to multi-speciesconserved sequences (MCS; see later) within the 27.6-kb region ofhighest association. Specifically, we identified the SNP RET+3 (markedby * in FIG. 1 a) within MCS+9.7 by re-sequencing HSCR patients fromfamilies with demonstrated RET-linkage but no identified coding sequencemutations. TDT of RET+3 in all 126 trios, demonstrated the largesttransmission distortion (τ=0.8) and the highest statistical significance(p=10⁻¹¹). Interestingly, when association tests are factored byoffspring gender, a known risk factor in HSCR, RET+3 and the adjacentmarker 1.1SfcI (3.3 kb away) are the only two SNPs demonstratingassociation in females. Two additional variants (rs2506005, rs2506004)lie within MCS+9.7 which are located 76 nt 5′ and 217 nt 3′ of RET+3,respectively; both are in complete linkage disequilibrium with RET+3 andeach other. The HSCR-associated allele at each of these additional SNPsis the ancestral allele. Interestingly, the RET+3:C allele is veryhighly conserved in all 9 mammalian species examined (FIG. 5) and it isthe derived polymorphic allele (RET+3:T) that is overtransmitted. Wepostulate that RET+3 is the most likely site of the disease variation.

It was queried whether HSCR-susceptibility within this locus can beexplained by RET alone or whether additional common variants might bepresent at GALNACT-2 or RASGEF1A. Tthe Exhaustive Allelic TDT (EATDT), anovel method to iteratively and successively test all possiblehaplotypes of all possible sizes for association with HSCR^(12,13) wasused. Seventeen haplotypes are significantly associated with HSCR butthey have two critical properties (FIG. 1 b): (1) no associatedhaplotype is limited to markers across GALNACT-2 or RASGEF1A; (2) allhaplotypes involve RET SNPs alone, particularly those in intron 1. Theseresults strongly suggest a role for a single, common variant within RET.Since all but one haplotype involves RET+3, it was concluded that theHSCR association arises from RET+3 (1) being in tight LD with a yetunknown disease-susceptibility variant, (2) being the disease-causingmutation alone, or (3) being a disease-causing variant that actssynergistically with additional disease variants on the associatedhaplotype.

Comparative Genomics to Define Functional Elements

The finding of association across an intron suggested the need toidentify functional elements within the RET locus. Systematiccomparisons of orthologous sequences can uncover coding and non-codingfunctional elements on the assumption that such regions evolve slowerthan non-functional (neutral) sequences.^(14, 11, 10, 9). The genomicsequence of a ˜350-kb segment encompassing human RET was obtained andcompared with the orthologous intervals in 12 non-human vertebrates.Multi-species conserved sequences (MCSs) were identified as theintersection of elements which satisfied the criteria of Bray¹⁵ andMargulies¹⁶. Synteny is preserved across this interval in allvertebrates examined, although the fraction of sequence that can bealigned with the human sequence decreases with increasing evolutionarydistance (FIG. 2 a).

A total of 84 MCSs were identified (Table 3), with 44% (37/84) of theidentified MCSs corresponding to exons of RET, GALNACT-2 and RASGEF1A.The remaining 47 MCSs are likely non-coding since no matching cDNAsequence or open reading frame greater than 20 amino acids in length wasfound. We identified 5 such elements within the most highly associated27.6-kb around RET intron 1 (MCS-5.2, MCS-1.3, MCS+2.8, MCS+5.1 andMCS+9.7, identified by their kb distance from the RET start site as(FIG. 3 a)).

TABLE 3 Positions of all identified MCSs^(a) Start End LengthDescription Exon # 42750079 42750298 219 Extragenic 42759068 42759363295 Extragenic 42765824 42766058 234 Extragenic 42767294 42767649 355Extragenic 42847632 42847887 255 Extragenic 42848019 42848428 409Extragenic 42849042 42849161 119 Extragenic 42851086 42851421 335Extragenic 42855277 42855460 183 Extragenic 42856618 42856867 249 RETcoding 1 42859464 42859741 277 RET intron 42861719 42861898 179 RETintron 42866040 42866290 250 RET intron 42879799 42880213 414 RET coding2 42881785 42882105 320 RET coding 3 42884371 42884649 278 RET coding 442885781 42885995 214 RET coding 5 42888446 42888719 273 RET coding 642890659 42890942 283 RET coding 7 42891570 42891681 111 RET coding 842892246 42892423 177 RET coding 9 42893017 42893163 146 RET coding 1042893936 42894201 265 RET coding 11 42896029 42896206 177 RET coding 1242897767 42897941 174 RET coding 13 42898979 42899202 223 RET coding 1442899531 42899649 118 RET coding 15 42901360 42901477 117 RET coding 1642903071 42903295 224 RET coding 17 42904333 42904442 109 RET coding 1842904882 42905016 134 RET intron 42906007 42906455 448 RET coding 1942906737 42907024 287 RET intron 42907345 42907818 473 RET coding 2042908007 42908108 101 RET intron 42908320 42908431 111 RET intron42908795 42908892 97 RET intron 42908915 42909011 96 RET coding 4290904742909213 166 RET 3′ UTR 21 42909233 42909531 298 RET 3′ UTR 21 4290962342909898 275 RET 3′ UTR 21 42920171 42920270 99 GALNACT-2 intron42932033 42932131 98 GALNACT-2 intron 42933380 42933507 127 GALNACT-2intron 42934314 42935286 972 GALNACT-2 coding 2 42935414 42935519 105GALNACT-2 intron 42938093 42938420 327 GALNACT-2 coding 3 4293843642938675 239 GALNACT-2 intron 42939392 42939528 136 GALNACT-2 intron42939590 42939779 189 GALNACT-2 intron 42939849 42940073 224 GALNACT-2coding 4 42941406 42941679 273 GALNACT-2 intron 42941954 42942112 158GALNACT-2 intron 42942628 42942804 176 GALNACT-2 intron 4294308842943227 139 GALNACT-2 intron 42943232 42943566 334 GALNACT-2 coding 542943578 42943752 174 GALNACT-2 intron 42946387 42946624 237 GALNACT-2coding 6 42962682 42963286 604 GALNACT-2 coding 8 42963538 42963668 130GALNACT-2 3′ UTR 8 42964122 42964240 118 GALNACT-2 3′ UTR 8 4296449842964819 321 GALNACT-2 3′ UTR 8 42974002 42974194 192 RASGEF1A 3′ UTR 1142974198 42974366 168 RASGEF1A 3′ UTR 11 42974969 42975687 718 RASGEF1A3′ UTR 11 42975877 42976015 138 RASGEF1A coding 10 42976428 42976562 134RASGEF1A coding 9 42977449 42977664 215 RASGEF1A coding 8 4297838442978654 270 RASGEF1A coding 7 42979069 42979280 211 RASGEF1A coding 642979602 42979699 97 RASGEF1A coding 5 42980068 42980400 332 RASGEF1Acoding 4 42981223 42981454 231 RASGEF1A coding 3 42982718 42982868 150RASGEF1A coding 2 42985367 42985589 222 RASGEF1A coding 1b 4299461942994928 309 RASGEF1A intron 42998282 42998387 105 RASGEF1A intron42998429 42998591 162 RASGEF1A intron 42999227 42999321 94 RASGEF1Aintron 43041678 43041839 161 RASGEF1A intron 43043917 43044130 213RASGEF1A intron 43044601 43044684 83 RASGEF1A intron 43045824 43046021197 RASGEF1A intron 43046209 43046493 284 RASGEF1A 5′ UTR 1a^(a)Positions on human chromosome 10 are given relative to build 34(July 2003) of the genome; see www.genome.ucsc.edu

Although GALNACT-2 and RASGEF1A are unlikely to harbor common HSCRvariants they might carry rare mutations and be important in HSCR, justas some of the 126 patients we studied also have rare RET mutations. Totest their involvement in enteric development and HSCR, their temporaland spatial expression in humans and mice was characterized.Transcription of RASGEF1A is limited to brain and several tissues (bonemarrow, testis, colon, and placenta) with high replicative capacity(FIG. 2 b, c, d). RET and GALNACT-2 share overlapping, nearly ubiquitouspostnatal expression patterns. Importantly, GALNACT-2 and RASGEF1A areboth highly expressed at 13.5 dpc, coincident with peak RET expressionand colonization of the gut by neural crest-derived neuronal precursors(FIG. 2 c), a feature disrupted in HSCR. Consequently, GALNACT-2 andRASGEF1A expression patterns are consistent with a potential role inenteric neural crest migration. The analysis of morpholino-based geneknockdowns of the orthologous genes in zebrafish has, however, uncoveredonly mid-gastrulation defects in convergence and extension for Galnact-2and central nervous system neuronal cell death by 24 hours postfertilization for Rasgef1a (data not shown). In contrast, similardisruption of RET results in incomplete colonization of the digestivetube by enteric neurons^(17,18). These functional analyses cannotexclude either GALNACT-2 or RASGEF1A as HSCR candidate genes as theobserved embryonic lethality occurred prior to the onset of neural crestcell migration into the digestive tube. However, genetic associationtests have excluded the occurrence of a common mutation at GALNACT-2 orRASGEF1A contributing to HSCR.

MCS+9.7 Functions as an Enhancer In Vitro

Although MCS+9.7 is likely a functional element, the specific functionof this sequence and the mechanism by which it exhibits a deleteriouseffect is not known. MCS+9.7 demonstrates a minimum identity of 72.5%with all mammalian species examined. No predicted structural/regulatoryRNAs were identified in MCS+9.7 using the QRNA algorithm¹⁹. The MCS+9.7sequence includes a gamut of predicted transcription factor bindingsites (Table 4), including two retinoic acid response elements (RARE)within four nucleotides on either side of the RET+3 site. However, nopredicted binding sites are disrupted directly by the mutant RET+3:Tallele or the alleles at the rs2506004 and rs2506005 sites. Importantly,retinoic acid has already been documented as a negative and a positiveregulator of RET expression in cardiac and renal development,respectively^(20,21). Furthermore, exogenous retinoic acid delayshindgut colonization by RET-positive enteric neuroblasts and results inectopic RET expression during embryogenesis²². Although the mutation(s)does not introduce or destroy a predicted RARE, it may introduce a novelsite that permits competition with, or reduces access to, theneighboring predicted RAREs. Clearly, the ultimate proof ofdisease-causation will require the synthesis of the trait, from one orall three of the MCS+9.7 variants, in an appropriate model organism.

TABLE 4 Predicted transcription factor binding sites  in MCS + 9.7^(a)Start Factor nucleotide^(b) Length Sequence Sp1 42866044  6 GGGGCC RAR42866048 10 CCAGTGACCC RORalpha1 42866051 13 GTGACCCTTACAT NP-III42866051  6 GTGACC AP-1 42866052  6 TGACCC RAR-alpha1 42866052  6 TGACCCSRF_Q6 42866054 14 ACCCTTACATGGTC SAP-1 42866056 10 CCTTACATGG SRF42866056 10 CCTTACATGG myc-CF1 42866060  6 ACATGG RC2 42866064  7GGTCATC RAR-alpha1 42866064 16 GGTCANNNNNNGGtCA CACCC- 42866083  6GGGTGG binding factor Sp1 42866083  6 GGGTGG CP2 42866088  7 GCCAGTC LVa42866095  6 CTGTTC NF-1 42866101  6 AGCCAG NF-1 42866109  6 CTTGCC NF-142866117  7 AGGAAAG SBF-1 42866123 14 GAAATTAATTATAA N-Oct-3 42866125  7MATWAAT MEF-2 42866127 10 TTAATTATAA TBP 42866127  7 TTAATTA RSRFC442866128  8 TAWWWWTA IF2 42866136 10 ACCTAATTGG CCAAT- 42866141  6ATTGGC binding factor NF-1/L 42866142  6 TTGGCA c-Ets-1_54 42866146 13CAGTTTCCTTTGC NFAT_Q6 42866146 12 CAGTTTCCTTTG IBP-1 42866146 11CAGTTTCCTTT PEA3 42866149  6 TTTCCT c-Ets-2 42866150  6 TTCCTT Oct-142866150 13 TTCCTTTGCATAG Pit-1a 42866155  7 TTGCATA EFII 42866156  6TGCATA Elk-1 42866162 16 GAAGCCGGAAGCAACT c-Myb 42866173  6 CAACTG Sp142866184  9 KRGGCKRRK GATA-1 42866192  6 TGATTA AP-1 42866192  7 TGATTAAZen-2 42866193 12 GATTAACTCTGC Eve 42866193 12 GATTAACTCTGC HNF-142866194  6 ATTAAC ITF-2 42866203 10 GCAGCAGCTG Myf-5 42866204  9CAGCAGCTG MyoD 42866204 11 CAGCAGCTGGG AP-4 42866204  9 CAGCAGCTG E2A42866206  7 GCAGCTG Myogenin 42866206  7 GCAGCTG RFX2 42866207  6 CAGCTGTal-1 42866207  6 CAGCTG AP-4 42866207  6 CAGCTG XPF-1 42866207  6CAGCTG C/EBPbeta 42866210  7 CTGGRAA Ik-1 42866211  6 TGGGAA EFII42866217  6 ATTGCA c-Myb 42866221  6 CAGTTG C/EBPalpha 42866223  6GTTGGG Ttk_88K 42866226 10 GGGCAGGAGC Sp1 42866226  6 GGGCAG Myogenin42866228  7 GCAGGAG PEA3 42866242  6 CATCCT Adf-1 42866251 16CAGGCCGCTGCAGCTG ITF-2 42866257 10 GCTGCAGCTG ^(a)Based on TRANSFAC 4.0predictions (http://www.cbil.upenn.edu/cgi-bin/tess) of p ≦ 10⁻² and La≧ 10. ^(b)Specified positions are in reference to chromosome 10, build34 (July 2003) of the human genome.

Based on its location, we predicted that the MCS+9.7 element functionsas a transcriptional enhancer or suppressor. Using transienttransfection assays, we tested the function of two RET intron 1constructs in the mouse neuroblastoma cell line Neuro-2a. Ampliconscontaining MCS+9.7 and MCS+5.1/+9.7 show enhancer activity in this cellline (FIG. 3 b), although this activity in HeLa cells is negligible(data not shown), suggesting that the activity of MCS+9.7 is cell-typedependent. Importantly, amplicons harbouring the mutant alleledemonstrate significantly lower enhancer activity (6- to 8-folddecrease) than those containing the wild type allele (t-test, pvalue≦0.001). These data suggest that the mutation lies within andcompromises the activity of an enhancer-like sequence in RET intron 1.RET coding sequence mutations in HSCR are always loss-of-functionalleles. Thus our finding that the RET+3 mutation decreasestranscription is consistent with HSCR biology. We can localize theenhancer function, and the genetic change which diminishes thatfunction, to the 900-nt fragment tested in the MCS+9.7 construct. Withinthis region exist three segregating sites (rs2506005, RET+3 andrs2506004) in complete LD. In principle, any one of these three sites,or their combination, can be the disease susceptibility factor.

World-Wide Distribution of MCS+9.7 Variants

The global distribution of the RET+3:T allele was determined bygenotyping individuals from 51 unselected populations. The mutant Tallele is virtually absent within Africa (<0.01), has intermediatefrequency in Europe (0.25) but reaches high frequency (0.45) in Asia(FIG. 4). Additionally, we generated haplotypes for 7 SNPs from 60individuals, each from Africa, Europe and Asia, derived from the aboveworld-wide set and compared them to haplotypes from HSCR patients (Table5). Haplotypes bearing the RET+3:T allele likely have a single origin,sometime after modern humans emerged from Africa. Intriguingly, the highfrequency of the RET+3:T allele, and the susceptibility haplotype, inEast Asia correlates with an increased incidence of short segment HSCRamong Asian newborns (3.1 vs. 1.5 per 10,000 births in Asian Americanversus European American births in California between 1983 and 1997; C.Torfs, 1998; personal communication). This same haplotype has a 66%frequency among Chinese sporadic HSCR patients⁵; consequently, a 2-foldincrease in the mutant allele frequency translates into a roughly 2-foldincrease in disease incidence. We suspect that RET+3:T is a marker forshort segment HSCR since the low frequency of the RET+3:T allele inAfrica correlates with a lower frequency of short segment HSCR amongAfrican Americans².

Haplotype frequencies in Africa, Asia, Europe and HSCR cases.

60 individuals were selected from the HGDP samples representing eachcontinent. HSCR: all available HSCR cases. Haplotypes were reconstructedusing PHASE^(41.) For each SNP, the HSCR-associated allele ishighlighted in yellow. Position of RET +3 is indicated by the red box. —indicates the haplotype was not observed among the chromosomesgenotyped.

These data strongly argue that among the three SNPs within MCS+9.7 onlythe RET+3 variant is the susceptibility mutation. The associated allelesat rs2506005, RET+3 and rs2506004 are the ancestral, derived andancestral alleles, respectively. Given our knowledge of human evolutionand that the susceptibility haplotype has 1% frequency in Africa, theancestral haplotype (with ancestral alleles at each SNP) was virtuallyextinct within Africa until it rose in frequency with the occurrence ofthe RET+3:T mutation.

This finding of a common allele that rapidly increased in frequency butis associated with a disease predisposition can be explained in one ofthree ways: (1) recurrent mutations from the wild type to the samedeleterious mutant; (2) chance increase by genetic drift; and (3)selective advantage of the mutation in heterozygotes. The finding of acommon haplotype suggests that the first explanation is unlikely. Todistinguish between the two remaining alternatives, we performed twoanalyses: (a) we estimated an F_(ST) value of 0.027; (b) we compared ourworld-wide mutant allele distribution (summarized as allele frequency<5% in Africa, >25% in Europe, and >40% in China/Japan) to that of 8,247SNPs from the ENCODE loci²³. Only 38 sites (0.46%) show the observed ora more extreme pattern, strongly suggesting selective advantage to themutation.

If polymorphisms make substantial contributions to common disorders thena significant fraction of them must have been exposed to selection. Itis not surprising, then, that a majority of common disease associationsinvolve alleles that provided (α-globin, β-globin²⁴, G6PD²⁵⁻²⁷, HLA²⁸,Fy²⁹ and other variants in malaria), or are suspected of providing(CCRA32 in HIV infection³⁰⁻³²), a survival advantage to humans. Thus,many common variants in currently common disorders perhaps stem fromalleles that were, or are, protective for another phenotype, providingmechanistic support to the common variant, common disease model ofgenetic disease^(33,34).

Prior to the advent of corrective surgical methodologies in the 1950s,HSCR was a uniformly fatal disorder, necessitating positively actingselective forces to maintain this deleterious allele at high frequency.Our demonstration that the RET+3:T allele is a derived allele that isvirtually absent in Africa but rose to a frequency of 0.25 in Europe and0.45 in Asia in 100,000 years or less is indicative of such a selectiveforce. RET is a tyrosine kinase receptor on the surface of neuroblasts,and many other cell types, and it is not inconceivable that it might bea target of pathogen entry, such as the chemokine receptors involved inHIV and malaria.

Genetic Properties of the RET+3 Susceptibility Allele

A pervasive feature of HSCR is the marked gender difference inexpression and incidence, with males being four times more likely to beaffected than females. These sex differences could arise from mutationson the X chromosome, but genome-wide mapping studies^(1,7) haveconsistently failed to identify an X-linked gene. Consequently, wetested whether the RET+3 variant at MCS+9.7 shows sex-specific effects.As shown in Table 1, transmission frequency of the associated allele inthe RET region is always smaller to affected daughters than to affectedsons, with rare exceptions at non-significant SNPs. Indeed, given thelower female penetrance, there were fewer affected daughters than sonsin our sample, and among them only the mutant SNP (boys: τ=0.86,p=3.7×10⁻¹¹; girls τ=0.68, p=0.02) and the SNP at 1.1SfcI, 3.3 kb away,are statistically significantly different from 0.50. Nevertheless, atrend test for a difference in male and female offspring transmissionfrequency is highly significant and estimates the male-to-femaletransmission ratio to be ˜2 (p=0.0007). Thus, the genetic effect atMCS+9.7 is significantly greater in sons than in daughters.

Two other features of the RET+3 mutation display sex differencesconsistent with the greater incidence in males than females. First, asshown in Table 2, the transmission frequency to affected sons anddaughters leads to a 5.7-fold and 2.1-fold increase in susceptibility inmales and in females, respectively, assuming a multiplicative model forpenetrance. Second, genotype frequencies of affected individuals can beused to estimate the penetrance, which varies between 6.2×10⁻⁵ and1.8×10⁻³ (Table 2) and is considerably smaller than that for longsegment HSCR. Our finding of gender differences in penetrance isconsistent with the greater incidence of HSCR in males. For all traitsdemonstrating gender-specific differences in incidence, affectedindividuals from the less frequently affected sex (females for HSCR)have a higher mean susceptibility. Therefore, when we consider thetotality of all susceptibility loci, we expect females with HSCR tocarry more susceptibility alleles than their male counterparts³⁵. Itfollows that the penetrance of any specific mutation must be lower forthe lesser affected sex, as observed here.

To assess the genetic impact of this common mutation we estimated theproportion of the total variance in susceptibility that the RET+3mutation explains. Surprisingly, only 2.63% and 1.14% of the variationis explained by the action of this mutation in males and females,respectively (Table 2). This is in contrast to the meagre 0.1% of thetotal variance in susceptibility explained by all known coding mutationsat RET². Consequently, the MCS+9.7 enhancer mutation explains a 10 to20-fold greater susceptibility variation than all other known RETmutations. However, our findings also caution that a considerable numberof additional loci may remain to be identified.

TABLE 2 Genetic characteristics of the RET enhancer mutation Observedgenotype Penetrance counts† Expected (×10⁵)§ Genotype Males Femalesfrequency‡ Males Females CC 40 15 0.58 16.1 ± 2.2  6.2 ± 0.9 CT 50 170.37 34.5 ± 3.8  6.4 ± 1.3 TT 37 26 0.06 175.0 ± 22.9 35.9 ± 8.0 Riskratio (γ) # 5.7  2.1  Variation (%) 2.63 1.14

A final interesting gender difference is that the mutant allele arisesfrom mothers and fathers in 35 and 18 of the 53 informative families,respectively. This is significantly different from expectation (p=0.02)and similar to the effect we previously observed in linkage analysis ofRET in a different series of families⁷. The cause of this bias isunknown since RET is not known to be imprinted; however, whether RETshows specific imprinting in neuroblasts is unknown.

The identification of the RET+3 mutation was aided by comparativesequence analysis and emphasized by its likely selection. This findinghas several implications for genetic analyses of both Mendelian andcomplex disease. Mutation searches as described herein in human diseaseinclude both coding sequences of genes and neighboring non-codingelements. For example, non-coding mutations may conspire with mutationsat additional genes for disease to occur, but also in rare Mendelianphenotypes where 10-15% of patients can have no recognized mutationsdespite incontrovertible evidence for a single known gene. Not allmutations for rare diseases are required to be rare or have 100%penetrance. Thus, the criterion of identifying mutations as sequencechanges that are absent in controls may not be appropriate for asignificant fraction of alterations and may exclude legitimatemutations. The inheritance patterns of single gene traits due to commonvariants are somewhat different from those we have come to expect fromrare Mendelizing mutations particularly when penetrance is not complete.Thus, apparent genetic heterogeneity in linkage or bilineal inheritancedoes not imply that mutations do not exist at a single locus.

A variety of non-coding elements are involved in transcription,translation, recombination, replication and repair, but full nature andfunction of these sequences is unknown. Comparative genomics provides anavenue for recognizing such elements in a generic way but this dependson the assumption that functional sites evolve recognizably slower thannon-functional sites. These analyses have shown that only 1.5% of thehuman genome is devoted to coding exons and, as much as, 3% to conservednon-coding sequences³⁶, implying that the latter may be particularlyimportant as sites of mutation. Provided herein is a molecular view to amultifactorial disorder: the most common mutation is non-coding, it haslow (marginal) penetrance, the mutation has sex-dependent effects andexplains only a small fraction of the total susceptibility to HSCR.Nevertheless, examples provided herein have three features that arerelevant to the analysis of common complex disorders. First, althoughthe known protein coding HSCR mutations have higher (51-72%) penetrance,their rarity in the population implies they explain only a minutefraction (0.1%) of the disorder. Thus, additional genes or environmentalfactors may explain disease incidence. Second, about 11% of our HSCRpatients have known RET coding mutations in addition to carrying theRET+3:T variant. It is not unlikely that coding and non-coding mutationmay act synergistically to affect disease penetrance, in other words,there may be more than one mutation per gene. Third, an enhancermutation allows us to speculate that additional factors (proteins)interact with this element and can mitigate or attenuate its geneticeffect on RET transcription. In sum, for common mutations, we expectthat mutation penetrance will depend on other alleles and genes (geneticbackground), epigenetic effects (such as those associated withsex-linked gene dosage), or even the environment.

Patient Samples.

We genotyped trios with 126 probands, all their parents (of which 3 wereaffected) plus 24 unaffected siblings; for the penetrance studies wealso genotyped additional probands for a total of 450 samples. All formsof HSCR (short segment, long segment, and total colonic aganglionosis)were represented in the patient sample. 11% of ascertained casespresented with additional anomalies, including definedneurocristopathies, chromosomal abnormalities (e.g., trisomy 21), andother defects. Ascertainment was conducted under informed consentapproved by the Institutional Review Board of Johns Hopkins UniversitySchool of Medicine. In addition to the HSCR patients and their families,we also genotyped 1,064 samples representing individuals from sixcontinents from the CEPH Human Genome Diversity Panel(http://www.cephb.fr/HGDP-CEPH-Panel/;³⁷).

SNP Genotyping.

We selected SNPs with a minimum minor allele frequency of 10%, withphysical map locations covering the three genes RET, GALNACT-2, RASGEF1Aand emphasizing the associated region within RET⁸. From dbSNP, weselected SNPs with known heterozygosity and/or SNPs with both allelesobserved twice (“double hit” SNPs); we used markers for which robustgenotyping assays could be developed. All SNPs are referred to by theirrs numbers. Genotypes were generated using the fluorogenic 5′ nucleaseassay (Taqman, Applied Biosystems, Foster City, Calif.). A TECAN Genesisworkstation was used for all liquid handling, thermal cycling wascompleted on MJ Research Tetrads, and end-point measurements were madeon an ABI 7900. Genotypes were determined using SDS 2.1 (AppliedBiosystems, Foster City, Calif.) and verified by the instrumentoperator. 10% of the samples (n=45) were genotyped in duplicate for all30 markers; no discrepancies were observed among the 1,350 pairedreplicate genotypes.

Transmission Disequilibrium Test.

The TDT chi square test statistic was used to identify significantdeviation from the expected 1:1 Mendelian transmission¹¹. Thetransmission frequency (T) from heterozygous parents to offspring wasestimated from all family genotype data at each SNP by maximumlikelihood. We assumed either a (i) single τ, (ii) T different by parentgender (τ_(m), τ_(f)), or (iii) different transmission rates to male (b)and female (g) children (τ_(b), τ_(g)). Chi square tests with 1 degreeof freedom based on the appropriate likelihood ratio were used to testwhether τ=½, τ_(m)=τ_(f) or τ_(b)=τ_(g).

Haplotype Reconstruction and Exhaustive Allelic TDT (EATDT).

For family based samples, haplotypes were inferred using hap2, a methodthat combines traditional family-based reconstructions withpopulation-based linkage disequilibrium information to achieve extremelyaccurate reconstruction within nuclear families¹². Haplotypes forcontrol HGDP individuals were reconstructed with PHASE³⁸. Exhaustiveallelic transmission disequilibrium tests (EATDT) were performed,following haplotype reconstruction, for all sliding windows of allnumbers of SNPs at all positions¹³. Within each window of any size, allobserved haplotypes were tested for association by the TDT. To assessoverall significance, while accounting for multiple tests, 10⁸permutations were performed to estimate a p value.

Re-Sequencing.

Three re-sequencing experiments were performed and analyzed to identifynovel SNPs: (1) DNA chip-based re-sequencing³⁹ of the non-repeatsequence in a 90-kb interval containing RET in 32 Mennonites (15 HSCRcases and 17 controls); (2) re-sequencing MCSs within RET intron 1 in 22HSCR patients from families with RET-linkage but no identified codingsequence mutations; (3) re-sequencing 9 kb around RET+3 in 4 and 8individuals each homozygous for the RET+3:T and the RET+3:C allele,respectively. These analyses identified numerous rare and novel SNPs,additional low frequency SNPs existing in dbSNP, and a high frequencySNP within intron 1 enriched in patients, RET+3. In addition to RET+3,we identified variants within three additional intron 1 conservedelements (see later) by re-sequencing in HSCR patients.

Allele Distribution at ENCODE Loci.

The ENCODE project²³ has identified all segregating sites at 5 loci onhuman chromosomes 2p16.3, 2q37.1, 4q26, 7q21.13 and 7q31.33 each ˜500 kbin length. All SNPs were genotyped in the HapMap samples from fourpopulations, namely, Utah CEPH, Yoruba from Ibadan, Nigeria, Han Chinesefrom Beijing and Japanese from Tokyo, Japan (www.HapMap.org). Weestimated allele frequencies at 8,247 SNPs in the three continentalregions (Europe, Africa and Asia; 60 independent individuals each) andcompared them to the RET+3:T allele. We estimated the probability ofobserving allele frequency <5% in Yoruba, >25% in Europe, and >40% inChina/Japan in all 8,247 SNPs as 0.0046. To reduce effects of LD, wesampled every second (4,121 SNPs), fourth (2,059 SNPs), eighth (1,028SNPs) and sixteenth (512 SNPs) SNP to obtain probabilities of 0.0036,0.0049, 0.0068 and 0.0059, respectively. An identical analysis using theF_(ST) statistics gave a p-value of 0.027 (0.023-0.029).

Estimating the Susceptibility Variance Due to a Polymorphism.

We assume that the variation in susceptibility to HSCR is multifactorialand parametrized as described in¹³. The three genotypes at thesusceptibility locus are AA, Aa and aa with frequencies p², 2pq, q²,respectively; means of 0, dt and t, respectively (t=displacement;d=degree of dominance); residual variance of 1 arising from additionalgenes and the environment. Genotype-specific susceptibilitydistributions are Gaussian, and all measurements on the susceptibilityscale are in standard deviation units. Affection arises whenever thesusceptibility exceeds a biological threshold Z so thatgenotype-specific penetrance is the integrated Gaussian density above Z.

Penetrance of the CC, CT and TT genotypes at RET+3 (C=wild type;T=mutant) can be estimated using inverse probability given the observednumbers of affecteds with these genotypes, assuming a disease incidence(S-HSCR and L-HSCR are 80% and 20% of the total incidence of 1/5,000)and the mutant allele frequency (q=0.24 from the untransmittedchromosomes in 252 parents of probands). Consequently, we can estimate Zfrom the CC penetrance, and given the threshold we can estimate thesusceptibility means from the two other genotype distributions;estimation was by the maximum likelihood method. Finally, the variancein susceptibility between genotypes can be calculated from the threeestimated means.

Multi-Species Genomic Sequences.

Genomic sequences orthologous to a 350-kb region encompassing the RETgene were generated from multiple species. Publicly available genomicsequences data were used for human and mouse (Hg16, chr10:42700000-43050000 (human) and Mm3, chr6: 118646816-119036816. Bacterialartificial chromosome (BAC) clones from seven non-human vertebrates(chimpanzee, baboon, cow, pig, cat, dog, and rat) were isolated byscreening BAC libraries with ‘universal’ hybridization probes⁴³. Fornon-mammalian organisms (chicken, zebrafish, fugu, and tetraodon),species-specific probes were designed from available gene sequence.Following mapping, selected BACs were sequenced by the NISC ComparativeSequencing Program. Additionally, orthologous chicken sequences wereobtained from the whole-genome assembly available athttp://genome.ucsc.edu.

Comparative Sequence Analysis.

Sequences were aligned and visualized with mVISTA^(18,44) andMultiPipMaker⁴⁵. Multi-species conserved sequences (MCSs) wereidentified with the algorithm of Margulies et al. (2003). Briefly, thismethod utilizes multiple alignments (MultiPipMaker) and calculatesconservation scores for 25-nt overlapping windows with 1-nt increments.We used 5% of the reference sequence as the appropriate cut-off forconserved sequence identification¹⁹ as 5% of the human genome ispresumed to be under natural selection³⁹. We considered the overlappingset of mVISTA:MCS elements because MCSs alone can fragment knownfunctional units (e.g. exons) into multiple smaller fragments. FormVISTA analysis, we chose a pairwise comparison between mouse and human.Importantly all elements identified between comparison with human andany other vertebrate were represented by the mouse-human comparisonsuggesting this pairwise comparison is fully representative of theconserved elements in the region. MCSs included >98.9% of allnucleotides within these exons and less than 0.59% of ancient repeatsequence in the region. The summed lengths of all identified MCSs was19.8-kb.

MCSs identified all exons encoding RET, GALNACT-2 and RASGEF1A. Noadditional genes were identified 5′ to RET in the region we obtained andsequenced. The human genome sequence (http://genome.ucsc.edu: build 35)predicts that the gene most proximal to the 5′ end of RET, BMS1L, aputative ribosome biogenesis protein, lies 246-kb upstream of RET exon1.

Expression Analysis.

Temporal and spatial expression patterns of RET, GALNACT-2, and RASGEF1Awere established by reverse transcriptase-polymerase chain reaction(RT-PCR) and northern blotting. Human total RNA samples were from theClontech™ (Palo Alto, Calif.) MTC human RNA panels. Embryonic andpost-natal mouse RNAs were isolated from timed matings between 129SvImJmice. All animal studies were conducted under protocols approved by theJohns Hopkins University Animal Care and Use Committee. All primer andprobe sequences used in this study are available athttp://chakravarti.igm.jhmi.edu/pro_site/projects/RET_Nature2005.

Luciferase Assays.

DNA samples from individuals homozygous for the T and C alleles at RET+3were amplified, sequenced to verify their composition, and cloned intothe Gateway pDONR™221 entry vector per the manufacturer's protocol.Amplicons were subcloned into a Sma I site in a Gateway® modified pGL3(Promega™, Madison, Wis.) firefly luciferase vector containing an SV40promoter and complete firefly luciferase open reading frame. Plasmidscontaining only the SV40 promoter and luciferase reporter(pDSma_promoter) and plasmids without the SV40 promoter (pDSma_control)served as experimental control vectors.

The neuroblastoma cell line (Neuro-2a, ATCC# CCL-131) was culturedaccording to ATCC protocols. Neuro-2a derive from a peripheral neuronalpopulation that expresses the products of several HSCR genes (Ret,Ednrb, and Sox10), the neural crest-specific p75^(NTR) gene, and theneuronal marker Dbh (data not shown). Approximately 1×10⁶ Neuro-2a cellswere co-transfected (Lipofectamine Plus™, Invitrogen, Carlsbad, Calif.)with 0.4 μg of the appropriate pDSma firefly luciferase plasmid and 0.01μg phRL-SV40 control Renilla luciferase plasmid; Renilla luciferasecontrol plasmid was used to normalize all data points. Dual Luciferase®Assays (Promega, Madison, Wis.) were performed 24 hours aftertransfection according to manufacturer's protocols (Monolight® 2010,Analytical Luminescence Laboratories, CA). Fold change was calculatedrelative to samples transfected with the promoter-only construct(pDSma_promoter). Statistical significance was determined using a2-tailed t-test assuming unequal variances.

Accession numbers for genomic sequences reported in this paper: Hg16,chr10:42700000-43050000 (human); Mm3, chr6:118646816-119036816 (mouse),AC125509 and AC125512 (baboon), AC124166 (cat), AC138567 (chicken),RP43-171H18 (chimpanzee), AC124163 and AC124164 (cow), AC123973 (dog),AC124911 and AC125500 (fugu), AC122156 and AC124165 (pig), AC114881(rat), AC135546 (tetra), and AC124155 (zebrafish).

REFERENCES

-   1. Bolk, S. et al. A human model for multigenic inheritance:    phenotypic expression in Hirschsprung disease requires both the RET    gene and a new 9q31 locus. Proc Natl Acad Sci USA 97, 268-73 (2000).-   2. Chakravarti, A. & Lyonnet, S. Hirschsprung disease (eds.    Scriver, C. R. & al., e.) (McGraw-Hill, New York, 2001).-   3. Carrasquillo, M. M. et al. Genome-wide association study and    mouse model identify interaction between RET and EDNRB pathways in    Hirschsprung disease. Nat Genet 32, 237-44 (2002).-   4. Borrego, S. et al. RET genotypes comprising specific haplotypes    of polymorphic variants predispose to isolated Hirschsprung disease.    J Med Genet 37, 572-8 (2000).-   5. Garcia-Barcelo, M. M. et al. Chinese patients with sporadic    Hirschsprung's disease are predominantly represented by a single RET    haplotype. J Med Genet 40, e122 (2003).-   6. Sancandi, M. et al. Single nucleotide polymorphic alleles in the    5′ region of the RET proto-oncogene define a risk haplotype in    Hirschsprung's disease. J Med Genet 40, 714-8 (2003).-   7. Gabriel, S. B. et al. Segregation at three loci explains familial    and population risk in Hirschsprung disease. Nat Genet 31, 89-93    (2002).-   8. McCallion, A. S. et al. Genomic Variation in Multigenic Traits:    Hirschsprung Disease (ed. Stillman, B.) (CSHL Press, Cold Spring    Harbor, 2003).-   9. Uyama, T. et al. Molecular cloning and expression of a second    chondroitin N-acetylgalactosaminyltransferase involved in the    initiation and elongation of chondroitin/dermatan sulfate. J Biol    Chem 278, 3072-8 (2003).-   10. Sato, T. et al. Molecular cloning and characterization of a    novel human beta 1,4-N-acetylgalactosaminyltransferase, beta    4GalNAc-T3, responsible for the synthesis of    N,N′-diacetyllactosediamine, galNAc beta 1-4GlcNAc. J Biol Chem 278,    47534-44 (2003).-   11. Spielman, R. S., McGinnis, R. E. & Ewens, W. J. Transmission    test for linkage disequilibrium: the insulin gene region and    insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52,    506-16 (1993).-   12. Lin, S., Chakravarti, A. & Cutler, D. J. Haplotype and Missing    Data Inference in Nuclear Families. Genome Res in press (2004).-   13. Lin, S., Chakravarti, A. & Cutler, D. J. Exhaustive allelic    transmission disequilibrium tests as a new approach to genome-wide    association studies. Nat Genet 36, 1181-8 (2004).-   14. Loots, G. G. et al. Identification of a coordinate regulator of    interleukins 4, 13, and 5 by cross-species sequence comparisons.    Science 288, 136-40 (2000).-   15. Bray, N., Dubchak, I. & Pachter, L. AVID: A global alignment    program. Genome Res 13, 97-102 (2003).-   16. Margulies, E. H., Blanchette, M., Haussler, D. & Green, E. D.    Identification and characterization of multi-species conserved    sequences. Genome Res 13, 2507-18 (2003).-   17. Shepherd, I. T., Pietsch, J., Elworthy, S., Kelsh, R. N. &    Raible, D. W. Roles for GFRalpha1 receptors in zebrafish enteric    nervous system development. Development 131, 241-9 (2004).-   18. Shepherd, I. T., Beattie, C. E. & Raible, D. W. Functional    analysis of zebrafish GDNF. Dev Biol 231, 420-35 (2001).-   19. Rivas, E. & Eddy, S. R. Noncoding RNA gene detection using    comparative sequence analysis. BMC Bioinformatics 2, 8 (2001).-   20. Shoba, T., Dheen, S. T. & Tay, S. S. Retinoic acid influences    the expression of the neuronal regulatory genes Mash-1 and c-ret in    the developing rat heart. Neurosci Lett 318, 129-32 (2002).-   21. Batourina, E. et al. Vitamin A controls epithelial/mesenchymal    interactions through Ret expression. Nat Genet 27, 74-8 (2001).-   22. Pitera, J. E., Smith, V. V., Woolf, A. S. & Milla, P. J.    Embryonic gut anomalies in a mouse model of retinoic Acid-induced    caudal regression syndrome: delayed gut looping, rudimentary cecum,    and anorectal anomalies. Am J Pathol 159, 2321-9 (2001).-   23. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306,    636-40 (2004).-   24. Haldane, J. B. S. The rate of mutation of human genes. Hereditas    35(Suppl), 267-273 (1948).-   25. Allison, A. C. G-6-PD deficiency in red blood cells of East    Africans. Nature 186, 531 (1960).-   26. Allison, A. C. & Clyde, D. F. Malaria in African children with    deficient erythrocyte glucose-6-phosphate dehydrogenase. Br Med J    5236, 1346-9 (1961).-   27. Motulsky, A. Metabolic polymorphisms and the role of infectious    disease in human evolution. Human Biology 32, 28 (1960).-   28. Hill, A. V. et al. Common west African HLA antigens are    associated with protection from severe malaria. Nature 352, 595-600    (1991).-   29. Miller, L. H., Mason, S. J., Clyde, D. F. & McGinniss, M. H. The    resistance factor to Plasmodium vivax in blacks. The    Duffy-blood-group genotype, FyFy. N Engl J Med 295, 302-4 (1976).-   30. Samson, M. et al. Resistance to HIV-1 infection in caucasian    individuals bearing mutant alleles of the CCR-5 chemokine receptor    gene. Nature 382, 722-5 (1996).-   31. Dean, M. et al. Genetic restriction of HIV-1 infection and    progression to AIDS by a deletion allele of the CKR5 structural    gene. Hemophilia Growth and Development Study, Multicenter AIDS    Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco    City Cohort, ALIVE Study. Science 273, 1856-62 (1996).-   32. Huang, Y. et al. The role of a mutant CCRS allele in HIV-1    transmission and disease progression. Nat Med 2, 1240-3 (1996).-   33. Collins, F. S. et al. New goals for the U.S. Human Genome    Project: 1998-2003. Science 282, 682-9 (1998).-   34. Lander, E. S. The new genomics: global views of biology. Science    274, 536-9 (1996).-   35. Falconer, D. S. The inheritance of liability to diseases with    variable age of onset, with particular reference to diabetes    mellitus. Ann Hum Genet 31, 1-20 (1967).-   36. Waterston, R. H. et al. Initial sequencing and comparative    analysis of the mouse genome. Nature 420, 520-62 (2002).-   37. Cann, H. M. et al. A human genome diversity cell line panel.    Science 296, 261-2 (2002).-   38. Stephens, M., Smith, N. J. & Donnelly, P. A new statistical    method for haplotype reconstruction from population data. Am J Hum    Genet 68, 978-89 (2001).-   39. Cutler, D. J. et al. High-throughput variation detection and    genotyping using microarrays. Genome Res 11, 1913-25 (2001).-   40. Thomas, J. W. et al. Parallel construction of orthologous    sequence-ready clone contig maps in multiple species. Genome Res 12,    1277-85 (2002).-   41. Thomas, J. W. et al. Comparative analyses of multi-species    sequences from targeted genomic regions. Nature 424, 788-93 (2003).-   42. Dubchak, I. et al. Active conservation of noncoding sequences    revealed by three-way species comparisons. Genome Res 10, 1304-6    (2000).-   43. Schwartz, S. et al. PipMaker—a web server for aligning two    genomic DNA sequences. Genome Res 10, 577-86 (2000).-   43. Thomas, J. W. et al. Parallel construction of orthologous    sequence-ready clone contig maps in multiple species. Genome Res.    12, 1277-1285 (2002).-   44. Dubchak, I et al. Active conservation of noncoding sequences    revealed by three-way species comparisons. Genome Res. 10, 1304-1306    (2000).-   45. Schwartz, S. et al. PipMaker—a web server for aligning two    genomic DNA sequences. Genome Res. 10, 577-586 (2000).

What is claimed is:
 1. A method of identifying a mutation in DNA,comprising: predicting a genetic interval for a disease; comparingorthologous sequences to refine a putative functional interval; andsequencing the putative functional interval subjects to identifymutations.
 2. The method of claim 1, further comprising classifying therefined interval into one or more of coding, non-coding, functional andnon-functional sequences.
 3. The method of claim 2, wherein the furthercomparing is after comparing orthologous sequences.
 4. The method ofclaim 1, wherein the predicting comprises one or more of transmissiondisequilibrium tests (TDT), linkage, or association studies.
 5. Themethod of claim 1, wherein the subjects comprise individuals fromaffected families.
 6. The method of claim 1, wherein the subjectscomprise affected and unaffected individuals.
 7. The method of claim 6,wherein mutations are over-represented in affected subjects as comparedto normal subjects.
 8. The method of claim 1, wherein the mutation isassociated with a multigenic disease.
 9. The method of claim 8, whereinthe multigenic disease comprise one or more of mental illness, cancer,cardiovascular disease, congenital anomalies, metabolic disorder inc butnot limited to diabetes, susceptibility to infection, drug response, ordrug tolerance.
 10. The method of claim 1, wherein the mutationcomprises a variant of RET.
 11. The method of claim 10, wherein the RETvariant comprises RET+3:T.
 12. The method of claim 1, wherein themutations are one or more of associated with a disease susceptibility,are causative of disease, are contributory to disease,
 13. The method ofclaim 1, wherein the mutation comprises a single nucleotidepolymorphism, a multi-nucleotide polymorphism, an insertion, a deletion,a repeat expansion, genomic rearrangements, or segmental amplification.14. The method of claim 1, wherein the orthologous sequences comprisevertebrate sequences.
 15. The method of claim 14, wherein the vertebratesequences comprise mammalian, reptilian, avian, amphibians, orosteichthyes.
 16. The method of claim 1, wherein at least twoorthologous sequences are compared to refine the interval.
 17. Themethod of claim 1, wherein the interval is refined by at least 20 fold.18. The method of claim 1, wherein the interval is refined by about 10fold.
 19. The method of claim 1, wherein the interval is refined byabout 5 fold.
 20. A method of identifying a diagnostic marker for adisease, comprising: predicting a genetic interval for a disease;comparing orthologous sequences to refine the interval; and sequencingthe refined interval in affected and unaffected subjects to therebyidentify a diagnostic marker associated with disease susceptibility,wherein the marker is over represented in affected subjects compared tounaffected subjects.
 21. The method of claim 20, further comprisingclassifying the refined interval into one or more of coding, non-coding,functional and non-functional sequences.
 22. The method of claim 21,wherein the further comparing is after comparing orthologous sequences.23. The method of claim 20, wherein the predicting comprises one or moreof transmission disequilibrium tests (TNTs), linkage, or associationstudies.
 24. The method of claim 20, wherein the subjects compriseaffected and unaffected individuals.
 25. The method of claim 24, whereinmutations are over-represented in affected subjects as compared tonormal subjects.
 26. The method of claim 20, wherein the mutation isassociated with a multigenic disease.
 27. The method of claim 26,wherein the multigenic disease comprise one or more of mental illness,cancer, cardiovascular disease, congenital anomalies, metabolic disorderinc but not limited to diabetes, susceptibility to infection, drugresponse, or drug tolerance.
 28. The method of claim 20, wherein themutations are one or more of associated with a disease susceptibility,are causative of disease, are contributory to disease,
 29. The method ofclaim 20, wherein the mutation comprises a single nucleotidepolymorphism, a multi-nucleotide polymorphism, an insertion, a deletion,a repeat expansion, genomic rearrangements, or segmental amplification.30. The method of claim 29, wherein the orthologous sequences comprisevertebrate sequences.
 31. The method of claim 30, wherein the vertebratesequences comprise mammalian, reptilian, avian, amphibians, orosteichthyes.
 32. The method of claim 20, wherein at least twoorthologous sequences are compared to refine the interval.
 33. Themethod of claim 20, wherein the interval is refined by at least 20 fold.34. The method of claim 20, wherein the interval is refined by about 10fold.
 35. The method of claim 20, wherein the interval is refined byabout 5 fold.
 36. The method of claim 20, further comprisingcharacterizing the marker.
 37. The method of claim 36, whereincharacterizing comprises one or more of expression analysis, promoteranalysis, regulatory element analysis, knock-out analysis, or knock-downanalysis.
 38. The method of claim 37, wherein one or more of theanalyses are done with a transgenic animal or a cell line.
 39. A methodof identifying a subject having Hirschsprung disease risk comprisingdetecting in the subject a mutation in the receptor tyrosine kinase RET,wherein a RET+3:T allele is associated with disease risk.
 40. The methodof claim 39, wherein the subject is a member of an affected family. 41.The method of claim 39, wherein RET is a maker for short segment HSCR.42. A kit for detecting the presence of HSCR comprising: primersamplifying the mutation and instructions for use.